# TikTok Claims — Exploratory Data Analysis

**Owner:** Katherine Ygbuhay  
**Updated:** September 2025  
**Stage:** 02 

**Goal:**  
Perform structured exploratory data analysis (EDA) to profile the dataset, assess quality, and surface distributional patterns or imbalances that may affect modeling.  

**Contents:**  
- Dataset structure and variable types  
- Missing values and data quality checks  
- Standardization of categorical fields  
- Frequency and balance checks across claim and author ban status  
- Distributional review of engagement counts (skewness, outliers, variance)

## Dataset Overview and Schema

In [1]:
# Core packages
import pandas as pd
import numpy as np

# Plotly renderer setup for consistent inline visuals in JupyterLab
import plotly.io as pio
pio.renderers.default = "jupyterlab"  # fallback: "png" if running outside Lab

# Improve readability of numeric outputs
pd.options.display.float_format = '{:.3f}'.format

In [2]:
# Resolve the case-study root so paths work from any launch directory
from pathlib import Path

def find_case_root(start: Path | None = None) -> Path:
    p = start or Path.cwd()
    for q in [p, *p.parents]:
        if (q / "notebooks").exists() and (q / "data").exists():
            return q
    return p  # fallback

CASE_ROOT = find_case_root()
DATA_FILE = CASE_ROOT / "data" / "raw" / "tiktok_dataset.csv"
assert DATA_FILE.exists(), f"Missing data file: {DATA_FILE}"

In [3]:
# Load dataset into dataframe
df = pd.read_csv(DATA_FILE)

In [4]:
# --- Basic overview: shape, columns, and a small preview --------------------
# Shape: quick sense of dataset size (rows, columns)
print("Shape:", df.shape)

# Columns: ordered list of field names to orient yourself
print("Columns:", list(df.columns))

# Preview: first 10 rows to eyeball values, formats, and obvious anomalies
display(df.head(10))

Shape: (19382, 12)
Columns: ['#', 'claim_status', 'video_id', 'video_duration_sec', 'video_transcription_text', 'verified_status', 'author_ban_status', 'video_view_count', 'video_like_count', 'video_share_count', 'video_download_count', 'video_comment_count']


Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


In [5]:
# --- Schema & missingness: tidy table vs. df.info() text dump ---------------
# Build a small DataFrame summarizing each column's dtype and missingness.
# This renders cleanly in notebooks and is easier to scan/sort than raw text.

schema = (
    df.dtypes.rename("dtype").to_frame()             # column -> dtype
      .assign(
          non_null = df.notna().sum(),               # count of non-missing
          missing  = df.isna().sum(),                # count of missing
          missing_pct = lambda t: 100 * t["missing"] / len(df)  # % missing
      )
      .reset_index()
      .rename(columns={"index": "column"})
      .sort_values("column")
)

# Pretty formatting for percent column; keeps numeric types for others
display(schema.style.format({"missing_pct": "{:.2f}%"}))

Unnamed: 0,column,dtype,non_null,missing,missing_pct
0,#,int64,19382,0,0.00%
6,author_ban_status,object,19382,0,0.00%
1,claim_status,object,19084,298,1.54%
5,verified_status,object,19382,0,0.00%
11,video_comment_count,float64,19084,298,1.54%
10,video_download_count,float64,19084,298,1.54%
3,video_duration_sec,int64,19382,0,0.00%
2,video_id,int64,19382,0,0.00%
8,video_like_count,float64,19084,298,1.54%
9,video_share_count,float64,19084,298,1.54%


In [6]:
# --- Numeric summary: go beyond describe() with median, skew, missing% ------
# We start from describe(), transpose so rows = features, and add useful stats.

# Select numeric-only to avoid warnings in skew/describe
num_cols = df.select_dtypes(include="number")

# If there are no numeric columns, skip gracefully
if num_cols.shape[1] == 0:
    print("No numeric columns to summarize.")
else:
    num_summary = (
        num_cols.describe(percentiles=[0.25, 0.5, 0.75]).T
          .rename(columns={"50%": "median"})            # clearer than '50%'
          .assign(
              missing = num_cols.isna().sum(),          # missing counts
              missing_pct = lambda t: 100 * t["missing"] / len(df),
              skew = num_cols.skew(numeric_only=True)   # distribution asymmetry
          )
          # Order columns for stakeholder readability
          .loc[:, ["count","mean","std","min","25%","median","75%","max","skew","missing_pct"]]
          .sort_index()
    )

    # Format skew and missing% to two decimals; leave others as-is
    display(num_summary.style.format({"skew": "{:.2f}", "missing_pct": "{:.2f}%"}))

Unnamed: 0,count,mean,std,min,25%,median,75%,max,skew,missing_pct
#,19382.0,9691.5,5595.245794,1.0,4846.25,9691.5,14536.75,19382.0,0.0,0.00%
video_comment_count,19084.0,349.312146,799.638865,0.0,1.0,9.0,292.0,9599.0,3.89,1.54%
video_download_count,19084.0,1049.429627,2004.299894,0.0,7.0,46.0,1156.25,14994.0,2.74,1.54%
video_duration_sec,19382.0,32.421732,16.229967,5.0,18.0,32.0,47.0,60.0,0.0,0.00%
video_id,19382.0,5627454067.339129,2536440464.169367,1234959018.0,3430416807.25,5618663579.0,7843960211.25,9999873075.0,0.0,0.00%
video_like_count,19084.0,84304.63603,133420.546814,0.0,810.75,3403.5,125020.0,657830.0,1.79,1.54%
video_share_count,19084.0,16735.248323,32036.17435,0.0,115.0,717.0,18222.0,256130.0,2.72,1.54%
video_view_count,19084.0,254708.558688,322893.280814,20.0,4942.5,9954.5,504327.0,999817.0,0.93,1.54%


In [7]:
# --- Target label distribution: class balance sanity check ------------------
# Replace 'claim_status' with your actual label column name if different.

label_col = "claim_status"

if label_col in df.columns:
    label_counts = (
        df[label_col]
          .value_counts(dropna=False)                      # include NaN if present
          .to_frame("count")
          .assign(pct=lambda s: (s["count"] / s["count"].sum() * 100).round(2))
    )
    display(label_counts)
else:
    print(f"Label column '{label_col}' not found; skipping class balance table.")

Unnamed: 0_level_0,count,pct
claim_status,Unnamed: 1_level_1,Unnamed: 2_level_1
claim,9608,49.57
opinion,9476,48.89
,298,1.54


### Data Inspection Summary

- **Dataset structure:** The dataframe contains 19,382 observations of TikTok videos, each with categorical, text, and numerical fields describing claims, opinions, and metadata.  
- **Variable types:** Columns include five `float64`, three `int64`, and four `object` datatypes.  
- **Missing values:** Several variables have nulls, including *claim status*, *video transcription*, and the count variables.  
- **Distributional checks:** Many count variables show long right tails and potential outliers. Their maximum values are orders of magnitude larger than the interquartile ranges, resulting in high standard deviations.  

## Exploratory Data Analysis: Claims, Engagement, and Author Status

In [8]:
# Distribution of claim vs opinion labels
df['claim_status'].value_counts()

claim_status
claim      9608
opinion    9476
Name: count, dtype: int64

- The dataset is relatively balanced between *claim* and *opinion* labels.

In [9]:
# Drop unlabeled rows before modeling
df = df[df['claim_status'].notna()].copy()

- Note: A small number of rows lack a `claim_status` label. These observations are excluded to ensure clean, labeled data for modeling.

In [10]:
# Average and median view counts for claim vs opinion videos
claims = df[df['claim_status'] == 'claim']
opinions = df[df['claim_status'] == 'opinion']

print("Claims — mean:", claims['video_view_count'].mean(), 
      "median:", claims['video_view_count'].median())
print("Opinions — mean:", opinions['video_view_count'].mean(), 
      "median:", opinions['video_view_count'].median())

Claims — mean: 501029.4527477102 median: 501555.0
Opinions — mean: 4956.43224989447 median: 4953.0


- Within each label, mean and median view counts are close.  
- However, claim videos generally attract far higher view counts than opinion videos.

In [11]:
# Normalize author ban status labels for consistency
df['author_ban_status'] = (
    df['author_ban_status']
    .str.strip().str.lower()
    .replace({'under  review': 'under review'})  # collapse double spaces
)

# Cross-tabulation of claim status and author ban status
df.groupby(['claim_status', 'author_ban_status']).size().unstack(fill_value=0)

author_ban_status,active,banned,under review
claim_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
claim,6566,1439,1603
opinion,8817,196,463


- Claim videos are more common among banned authors than opinion videos.  
- This could reflect stricter moderation for claims, but causation cannot be inferred:  
  - Claim content may inherently invite more scrutiny.  
  - Or authors posting claims may also post other content that violates terms.  
- Important: we cannot conclude that specific videos caused a ban.

In [12]:
# Engagement metrics (mean/median) by author ban status
df.groupby('author_ban_status').agg({
    'video_view_count': ['mean', 'median'],
    'video_like_count': ['mean', 'median'],
    'video_share_count': ['mean', 'median']
})

Unnamed: 0_level_0,video_view_count,video_view_count,video_like_count,video_like_count,video_share_count,video_share_count
Unnamed: 0_level_1,mean,median,mean,median,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
active,215927.04,8616.0,71036.534,2222.0,14111.466,437.0
banned,445845.439,448201.0,153017.237,105573.0,29998.943,14468.0
under review,392204.836,365245.5,128718.05,71204.5,25774.697,9444.0


- Banned authors’ videos tend to have substantially higher engagement (views, likes, shares).  
- Mean values are consistently greater than medians, pointing to long-tailed distributions.  

In [13]:
# Engagement metrics (count, mean, median) by author ban status
df.groupby('author_ban_status').agg({
    'video_view_count': ['count', 'mean', 'median'],
    'video_like_count': ['count', 'mean', 'median'],
    'video_share_count': ['count', 'mean', 'median']
})

Unnamed: 0_level_0,video_view_count,video_view_count,video_view_count,video_like_count,video_like_count,video_like_count,video_share_count,video_share_count,video_share_count
Unnamed: 0_level_1,count,mean,median,count,mean,median,count,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
active,15383,215927.04,8616.0,15383,71036.534,2222.0,15383,14111.466,437.0
banned,1635,445845.439,448201.0,1635,153017.237,105573.0,1635,29998.943,14468.0
under review,2066,392204.836,365245.5,2066,128718.05,71204.5,2066,25774.697,9444.0


- Banned and under-review authors receive far more engagement than active authors.  
- Count columns provide context: sample sizes differ between groups.  
- Means >> medians confirm outliers driving up averages.  

In [14]:
# Median share counts by author ban status
df.groupby('author_ban_status')['video_share_count'].median()

author_ban_status
active           437.000
banned         14468.000
under review    9444.000
Name: video_share_count, dtype: float64

- Banned authors’ videos have a median share count ~33× higher than active authors.  
- This dramatic difference highlights the role of virality and moderation bias.  

In [15]:
# Per-view engagement ratios (safe against zero views)
denom = df['video_view_count'].replace(0, np.nan)

df['likes_per_view']    = df['video_like_count']    / denom
df['comments_per_view'] = df['video_comment_count'] / denom
df['shares_per_view']   = df['video_share_count']   / denom

# Replace NaNs from divide-by-zero with 0
df[['likes_per_view','comments_per_view','shares_per_view']] = (
    df[['likes_per_view','comments_per_view','shares_per_view']].fillna(0)
)

# Summary stats by claim and author ban status
df.groupby(['claim_status', 'author_ban_status']).agg({
    'likes_per_view': ['mean', 'median'],
    'comments_per_view': ['mean', 'median'],
    'shares_per_view': ['mean', 'median']
})

Unnamed: 0_level_0,Unnamed: 1_level_0,likes_per_view,likes_per_view,comments_per_view,comments_per_view,shares_per_view,shares_per_view
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,median,mean,median,mean,median
claim_status,author_ban_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
claim,active,0.33,0.327,0.001,0.001,0.065,0.049
claim,banned,0.345,0.359,0.001,0.001,0.068,0.052
claim,under review,0.328,0.321,0.001,0.001,0.066,0.05
opinion,active,0.22,0.218,0.001,0.0,0.044,0.032
opinion,banned,0.207,0.198,0.0,0.0,0.041,0.031
opinion,under review,0.226,0.228,0.001,0.0,0.044,0.035


- Claim videos show higher per-view engagement rates than opinion videos.  
- Within claims, banned authors’ videos have slightly higher rates than others.  
- Within opinions, active/under-review authors outperform banned authors.  

In [16]:
# Sanity check: ensure no NaN/inf values remain in per-view ratios
bad = df[['likes_per_view','comments_per_view','shares_per_view']].replace([np.inf,-np.inf], np.nan)
assert bad.isna().sum().sum() == 0, "Found NaN/inf in per-view ratios"
print("Per-view ratios look clean ✅")

Per-view ratios look clean ✅


In [17]:
# Per-view engagement ratios with counts (sample size context)
df.groupby(['claim_status', 'author_ban_status']).agg({
    'likes_per_view': ['count', 'mean', 'median'],
    'comments_per_view': ['count', 'mean', 'median'],
    'shares_per_view': ['count', 'mean', 'median']
})

Unnamed: 0_level_0,Unnamed: 1_level_0,likes_per_view,likes_per_view,likes_per_view,comments_per_view,comments_per_view,comments_per_view,shares_per_view,shares_per_view,shares_per_view
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,count,mean,median,count,mean,median
claim_status,author_ban_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
claim,active,6566,0.33,0.327,6566,0.001,0.001,6566,0.065,0.049
claim,banned,1439,0.345,0.359,1439,0.001,0.001,1439,0.068,0.052
claim,under review,1603,0.328,0.321,1603,0.001,0.001,1603,0.066,0.05
opinion,active,8817,0.22,0.218,8817,0.001,0.0,8817,0.044,0.032
opinion,banned,196,0.207,0.198,196,0.0,0.0,196,0.041,0.031
opinion,under review,463,0.226,0.228,463,0.001,0.0,463,0.044,0.035


- Adding counts confirms robust sample sizes for both claims and opinions.  
- Results validate earlier findings: claim status drives engagement more than ban status.  
- Again, right-skew is visible (mean > median).

## Section 2 Summary: Exploratory Analysis

- **Label balance:** The dataset is reasonably balanced between claims and opinions, making it suitable for supervised modeling.  
- **Claim dynamics:** Claim videos generally attract higher view counts and engagement than opinion videos.  
- **Author ban patterns:** Banned or under-review authors account for disproportionately higher engagement, though this reflects correlation only and causality cannot be inferred from the dataset.  
- **Engagement distributions:** Across groups, means are consistently greater than medians, confirming right-skewed, long-tailed distributions.  
- **Shares:** Median share counts are substantially higher for banned authors, suggesting virality dynamics that warrant closer review.  
- **Normalized engagement:** Per-view rates (e.g., likes per view, comments per view, shares per view) highlight claim status as the stronger predictor of engagement compared to ban status.  
- **Sample size context:** Including counts confirms that observed trends are based on robust subsets of the data, not artifacts of small groups.  

**Overall:** Claim content and author ban status both correlate with higher engagement, though claim status appears to drive engagement more strongly. These findings will inform subsequent feature engineering and model development.  