
# Data exploration and hypothesis testing

**The goal** is to apply descriptive and inferential statistics, probability distributions, and hypothesis testing in Python.
<br/>

*This activity has three parts:*

**Part 1:** Imports and data loading <br>
**Part 2:** Conduct hypothesis testing <br>
**Part 3:** Communicate insights with stakeholders <br>

## **PACE: Plan**

Consider the questions in your PACE Strategy Document and those below to craft your response.

1. What is your research question for this data project? Later on, you will need to formulate the null and alternative hypotheses as the first step of your hypothesis test. Consider your research question now, at the start of this task.

**Part 1:** Imports, links, and loading

In [3]:
# Import packages for data manipulation
import pandas as pd
import numpy as np

# Import packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import packages for statistical analysis/hypothesis testing
from scipy import stats

# Suppress specific FutureWarnings from Seaborn
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="seaborn")

# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")

### PACE: Analyze and Construct
**Part 2:** Statistical Tests

1. Data professionals use descriptive statistics for Exploratory Data Analysis. How can computing descriptive statistics help you learn more about your data in this stage of your analysis?


Check for and handle missing values.

In [4]:
# Check for missing values
data.isna().sum()

#                             0
claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64

In [5]:
# Drop rows with missing values
data = data.dropna(axis=0)

In [6]:
# Display first few rows after handling missing values
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


You are interested in the relationship between `verified_status` and `video_view_count`. One approach is to examine the mean value of `video_view_count` for each group of `verified_status` in the sample data.



In [9]:
# Compute the mean `video_view_count` for each group in `verified_status`
data.groupby("verified_status")["video_view_count"].mean().round(2)


verified_status
not verified    265663.79
verified         91439.16
Name: video_view_count, dtype: float64

**Part 2:** Hypothesis Testing
<br>
**Objective**: Conduct a two-sample t-test 
1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis
<br>
**Null hypothesis and alternative hypothesis**
<br>
* $H_0$: There is no difference in number of views between TikTok videos posted by verified accounts and TikTok videos posted by unverified
     accounts (any observed difference in the sample data is due to chance or sampling variability).
<br>
* $H_A$**: There is a difference in number of views between TikTok videos posted by verified accounts and TikTok videos posted by unverified
     accounts (any observed difference in the sample data is due to an actual difference in the corresponding population means).
<br>
* Choose a signficance level
<br>
8 choose 5% as the significance level

**Part 2:** Hypothesis Testing <br>
**Objective**: Conduct a two-sample t-test 
<br>
**State the null hypothesis and the alternative hypothesis**
* $H_0$: There is no difference in number of views between TikTok videos posted by verified accounts and TikTok videos posted by unverified
     accounts (any observed difference in the sample data is due to chance or sampling variability).
* $H_A$: There is a difference in number of views between TikTok videos posted by verified accounts and TikTok videos posted by unverified
     accounts (any observed difference in the sample data is due to an actual difference in the corresponding population means) <br>
**Choose 5% as the significance level**


In [15]:
# Conduct a two-sample t-test to compare means
# Save each sample in a variable
not_verified = data[data["verified_status"] == "not verified"]["video_view_count"]
verified = data[data["verified_status"] == "verified"]["video_view_count"]

# Implement a t-test using the two samples
stats.ttest_ind(a=not_verified, b=verified, equal_var=False)


TtestResult(statistic=25.499441780633777, pvalue=2.6088823687177823e-120, df=1571.163074387424)

Since the p-value is extremely small (much smaller than the significance level of 5%), reject the
null hypothesis.It can be  concludde that there is a statistically significant difference in the mean video
view count between verified and unverified accounts on TikTo.


### PACE: Execute

**Verified Status Hypothesis Test Results**:

Statistically significant difference in average view counts between verified and unverified accounts (p < 0.001). Suggests behavioral differences (e.g., clickbait or bot activity inflating unverified views).

**Next**: Logistic regression on `verified_status` as baseline for claim prediction (handles skew and binary target).