# Project Goal:
The TikTok data team is developing a machine learning model for classifying claims made in videos submitted to the platform.

## Background:
TikTok is the leading destination for short-form mobile video. The platform is built to help imaginations thrive. TikTok's mission is to create a place for inclusive, joyful, and authentic content—where people can safely discover, create, and connect.

## Scenario:
The TikTok data team has successfully completed exploratory data analysis on the data for the claims classification project. The team is ready to begin the process of hypothesis testing. You’ve been asked to investigate TikTok's user claim dataset to determine which hypothesis testing method best serves the data and the claims classification project.

## Course 4 Tasks:
1. **Import relevant packages and TikTok data**
2. **Explore the project data**
3. **Implement a hypothesis test**
4. **Communicate insights with stakeholders within TikTok**

# Specific Project Deliverables

With this end-of-course project, you will gain valuable practice and apply your new skills as you complete the following:

1. **Course 4 PACE Strategy Document** to consider questions, details, and action items for each stage of the project scenario.
2. **Answer the questions** in the Jupyter notebook project file.
3. **Consider the different groups of data** represented in the dataset.
4. **Implement a hypothesis test.**
5. **Create an executive summary** to share your results.

# Dataset Overview: `tiktok_dataset.csv`

This dataset contains synthetic data created for this project in partnership with TikTok. It contains **19,383 rows**, where each row represents a different published TikTok video in which a claim/opinion has been made. The dataset has **12 columns** described below:

| **Column Name**            | **Type** | **Description** |
|----------------------------|----------|-----------------|
| `#`                        | int      | TikTok assigned number for video with claim/opinion. |
| `claim_status`             | obj      | Whether the published video has been identified as an “opinion” or a “claim.” An “opinion” refers to personal beliefs or thoughts, and a “claim” refers to unsourced or unverified information. |
| `video_id`                 | int      | Random identifying number assigned to a video upon publication on TikTok. |
| `video_duration_sec`       | int      | How long the published video is, measured in seconds. |
| `video_transcription_text` | obj      | Transcribed text of the words spoken in the published video. |
| `verified_status`          | obj      | Indicates the verification status of the user who published the video: “verified” or “not verified.” |
| `author_ban_status`        | obj      | Indicates the permissions status of the user who published the video: “active,” “under scrutiny,” or “banned.” |
| `video_view_count`         | float    | The total number of times the published video has been viewed. |
| `video_like_count`         | float    | The total number of times the published video has been liked by other users. |
| `video_share_count`        | float    | The total number of times the published video has been shared by other users. |
| `video_download_count`     | float    | The total number of times the published video has been downloaded by other users. |
| `video_comment_count`      | float    | The total number of comments on the published video. |


# **Course 4 End-of-course project: Data exploration and hypothesis testing**

**The purpose** of this project is to demostrate knowledge of how to prepare, create, and analyze hypothesis tests.

**The goal** is to apply descriptive and inferential statistics, probability distributions, and hypothesis testing in Python.
<br/>

*This activity has three parts:*

**Part 1:** Imports and data loading
* What data packages will be necessary for hypothesis testing?

**Part 2:** Conduct hypothesis testing
* How will descriptive statistics help you analyze your data?

* How will you formulate your null hypothesis and alternative hypothesis?

**Part 3:** Communicate insights with stakeholders

* What key business insight(s) emerge from your hypothesis test?

* What business recommendations do you propose based on your results?

# **Data exploration and hypothesis testing**

#### What is your research question for this data project?

**Question:** Is there a statistically significant difference in the number of views for TikTok videos posted by verified accounts versus unverified accounts?

### **Task 1. Imports and Data Loading**

In [1]:
# Import packages for data manipulation
import pandas as pd
import numpy as np

# Import packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Import packages for statistical analysis/hypothesis testing
from scipy import stats

In [2]:
# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")

### **Task 2. Data exploration**

In [3]:
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [4]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
#,19382.0,9691.5,5595.246,1.0,4846.25,9691.5,14536.75,19382.0
video_id,19382.0,5627454000.0,2536440000.0,1234959000.0,3430417000.0,5618664000.0,7843960000.0,9999873000.0
video_duration_sec,19382.0,32.42173,16.22997,5.0,18.0,32.0,47.0,60.0
video_view_count,19084.0,254708.6,322893.3,20.0,4942.5,9954.5,504327.0,999817.0
video_like_count,19084.0,84304.64,133420.5,0.0,810.75,3403.5,125020.0,657830.0
video_share_count,19084.0,16735.25,32036.17,0.0,115.0,717.0,18222.0,256130.0
video_download_count,19084.0,1049.43,2004.3,0.0,7.0,46.0,1156.25,14994.0
video_comment_count,19084.0,349.3121,799.6389,0.0,1.0,9.0,292.0,9599.0


In [6]:
# Check for missing values
missing_values = data.isnull().sum()
print(missing_values)

#                             0
claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64


In [8]:
# Drop rows with missing values
data = data.dropna()

In [9]:
# Display first few rows after handling missing values
data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [15]:
# Compute the mean `video_view_count` for each group in `verified_status`
mean_view_count = data.groupby('verified_status')['video_view_count'].mean()
print(mean_view_count)

verified_status
not verified    265663.785339
verified         91439.164167
Name: video_view_count, dtype: float64


In [14]:
# Compute the median `video_view_count` for each group in `verified_status`
median_view_count = data.groupby('verified_status')['video_view_count'].median()
print(median_view_count)

verified_status
not verified    46723.0
verified         6023.5
Name: video_view_count, dtype: float64


### **Task 3. Hypothesis testing**

Before conducting hypothesis test, following questions need to be answered where applicable to complete your code response:

1. Recall the difference between the null hypothesis and the alternative hypotheses
2. What are your hypotheses for this data project?

### 1. Difference between the null hypothesis and the alternative hypothesis:

- **Null Hypothesis (H₀):** The null hypothesis is a statement that there is **no effect or no difference** between groups or variables. It assumes that any observed differences are due to chance or random variation. In the context of hypothesis testing, we aim to either reject or fail to reject the null hypothesis based on the data.

- **Alternative Hypothesis (H₁):** The alternative hypothesis is a statement that there **is an effect or a difference** between groups or variables. It is the hypothesis you are testing for and would like to provide evidence in support of if the data suggests a statistically significant result.

### 2. Hypotheses for this data project:

- **Null Hypothesis (H₀):** There is no statistically significant difference in the mean `video_view_count` between TikTok videos posted by verified accounts and unverified accounts.
  
  \[H_0: \mu_{\text{verified}} = \mu_{\text{unverified}}\]

- **Alternative Hypothesis (H₁):** There is a statistically significant difference in the mean `video_view_count` between TikTok videos posted by verified accounts and unverified accounts.

  \[
  H_1: \mu_{\text{verified}} \neq \mu_{\text{unverified}}
  \]


The goal in this step is to conduct a two-sample t-test. Recall tThehe steps for conducting a hypothesis test are:

1. State the null hypothesis and the alternative hypothesis
2. Choose a signficance level
3. Find the p-value
4. Reject or fail to reject the null hypothesis

In [12]:
# Conduct a two-sample t-test to compare means

# Separate the data into two groups: verified and unverified
verified = data[data['verified_status'] == 'verified']['video_view_count']
unverified = data[data['verified_status'] == 'not verified']['video_view_count']

# Perform an independent t-test (assuming equal variances)
t_stat, p_value = stats.ttest_ind(verified, unverified, equal_var=True)

# Display the results
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Check the significance at the 5% level
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There is a statistically significant difference in video view counts between verified and unverified accounts.")
else:
    print("Fail to reject the null hypothesis: There is no statistically significant difference in video view counts between verified and unverified accounts.")

T-statistic: -18.250939509545823
P-value: 8.632160883925904e-74
Reject the null hypothesis: There is a statistically significant difference in video view counts between verified and unverified accounts.


### Business Insights from the Hypothesis Test:

The results of the hypothesis test indicate a **statistically significant difference** in the average video view counts, with **non-verified accounts receiving more views than verified accounts**. This is a surprising result, as one might expect verified accounts to receive more views due to their credibility. The very small p-value (8.63e-74) allows us to confidently reject the null hypothesis, confirming this observation.

#### Key Insights:

1. **Potential for Growth Among Non-Verified Accounts:** Non-verified accounts seem to be driving more views. This could suggest that content from everyday users, creators, or less-established accounts may resonate more with TikTok's broader audience. TikTok could explore ways to further empower non-verified creators through tools, visibility, or incentivizing content creation.

2. **Audience Preference for Non-Verified Content:** The platform's users might prefer the content from non-verified accounts, possibly because it feels more authentic, grassroots, or relatable. TikTok could capitalize on this by highlighting the diversity and authenticity of non-verified creators.

3. **Re-evaluating the Role of Verified Accounts:** The lower view count for verified accounts could indicate that the verified badge doesn't necessarily translate to higher engagement or that these accounts may focus more on niche audiences. TikTok may need to reassess how it promotes verified users and ensure their content reaches the intended audience more effectively.

4. **Content and Algorithm Strategy:** These insights could prompt TikTok to adjust its recommendation algorithms, ensuring a balance between promoting content from verified and non-verified accounts to maintain platform diversity while engaging a broader audience.
