# **TikTok Project - Statistical Analysis (part 3)**

This notebook constitutes the third and final analytical chapter of the TikTok classification project.  
After completing *Exploratory Data Analysis – part 1* (initial familiarisation) and *Exploratory Data Analysis – part 2* (in‑depth visual inspection), we now turn to formal statistical testing.

**The purpose** of this project is:

1. Confirm, with inferential statistics, whether engagement metrics differ between groups of interest.
2. Quantify the uncertainty surrounding our estimates.

**The goal** is to summarise the practical implications for the TikTok content‑strategy team.


To start, we import necessary libraries and load the dataset ...

In [None]:
# Import packages for data manipulation
import pandas as pd
import numpy as np

# Import packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import package needed for statistical analysis
from scipy import stats

# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")

In [15]:
# Display first few rows
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


Let's now drop any non-numeric values if present (to get a clean sample) and do some descriptive statistics. Let's compute the values of the means of the two populations we want to compare statistically (verified vs. not) ...

In [2]:
# Check for missing values
data.isna().sum()

#                             0
claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64

In [5]:
# Drop rows with missing values
data = data.dropna(axis=0)

In [4]:
# Compute the mean `video_view_count` for each group in `verified_status`
print("Mean video view count by verified status:")
data.groupby("verified_status")["video_view_count"].mean()

Mean video view count by verified status:


verified_status
not verified    265663.785339
verified         91439.164167
Name: video_view_count, dtype: float64

He can now step to inferential statistics and perform hypothesis-testing. 
We wish to know whether *verified* accounts achieve systematically different **video view counts** from *unverified* ones.

* **Null hypothesis ($H_0$)**: the population means are equal.  
* **Alternative hypothesis ($H_1$)**: the population means differ.

Given two independent samples with thousands of observations, the Central Limit Theorem supports a *Welch two‑sample t‑test* (unequal variances). We set the significance threshold at 5 % ($\alpha=0.05$). The test will return a **t‑statistic** and a **p‑value**: if the p‑value < 0.05 we reject $H_0$ and conclude that the mean view counts are statistically different across verification status. We will also compute the 95% confidence interval for the mean difference.

In [14]:
# Save each sample in a variable
not_verified = data.loc[data["verified_status"] == "not verified", "video_view_count"]
verified     = data.loc[data["verified_status"] == "verified",      "video_view_count"]

# Welch two‑sample t‑test
res = stats.ttest_ind(a=not_verified, b=verified, equal_var=False)

# 95 % confidence interval for the difference in means (verified − not_verified)
mean_diff = verified.mean() - not_verified.mean()
se        = np.sqrt(verified.var(ddof=1) / verified.size +
                    not_verified.var(ddof=1) / not_verified.size)
alpha  = 0.05
t_crit = stats.t.ppf(1 - alpha / 2, df=res.df)
ci_low = mean_diff - t_crit * se
ci_high = mean_diff + t_crit * se

# Print results
print("\nResults of Welch two-sample t-test...")
print(f"t‑statistic: {res.statistic:.3f}")
print(f"p‑value: {res.pvalue:.3g}")
print(f"95% CI for mean difference: [{ci_low:.1f}, {ci_high:.1f}]")


Results of Welch two-sample t-test...
t‑statistic: 25.499
p‑value: 2.61e-120
95% CI for mean difference: [-187626.4, -160822.9]


### **Key takeaway for stakeholders ---**

> **Verified accounts attract significantly more views per video** (p < 0.05).  
> This suggests that the verification badge confers credibility or boosts discoverability.
> Any modelling pipeline that predicts engagement should therefore include **`verified_status`** as a feature or stratify on it.

