# **TikTok Project**
**Course 4 - The Power of Statistics**

You are a data professional at TikTok. The current project is reaching its midpoint; a project proposal, Python coding work, and exploratory data analysis have all been completed.

The team has reviewed the results of the exploratory data analysis and the previous executive summary the team prepared. You received an email from Orion Rainier, Data Scientist at TikTok, with your next assignment: determine and conduct the necessary hypothesis tests and statistical analysis for the TikTok classification project.

A notebook was structured and prepared to help you in this project. Please complete the following questions.

# **Data exploration and hypothesis testing**

### **Task 1. Imports and Data Loading**

Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

In [6]:
# Import packages for data manipulation

import pandas as pd
import numpy as np

# Import packages for data visualization

import matplotlib.pyplot as plt
import seaborn as sns

# Import packages for statistical analysis/hypothesis testing

from scipy import stats

In [7]:
# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")

## **Analyze and Construct**


### **Task 2. Data exploration**

Use descriptive statistics to conduct Exploratory Data Analysis (EDA).



Inspect the first five rows of the dataframe.

In [8]:
# Display first few rows
data.head()


Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


In [11]:
# Generate a table of descriptive statistics about the data
data.describe(include="all")


Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19084,19382.0,19382.0,19084,19382,19382,19084.0,19084.0,19084.0,19084.0,19084.0
unique,,2,,,19012,2,3,,,,,
top,,claim,,,a friend read in the media a claim that badmi...,not verified,active,,,,,
freq,,9608,,,2,18142,15663,,,,,
mean,9691.5,,5627454000.0,32.421732,,,,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,,2536440000.0,16.229967,,,,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,,1234959000.0,5.0,,,,20.0,0.0,0.0,0.0,0.0
25%,4846.25,,3430417000.0,18.0,,,,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,,5618664000.0,32.0,,,,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,,7843960000.0,47.0,,,,504327.0,125020.0,18222.0,1156.25,292.0


Check for and handle missing values.

In [17]:
# Check for missing values
size = data.size
row_drop = data.dropna(axis=0)
col_drop = data.dropna(axis=1)
data.isna().sum()
row_proportion = row_drop.size/size
col_proportion = col_drop.size/size

print(row_proportion, col_proportion)

0.9846249097100402 0.4166666666666667


This indicates that the appropriate method to handle missing values will be to remove rows containing missing data as we will retain a greater proportion of the dataset.

In [18]:
# Drop rows with missing values using axis=0 as it will retain a greater proportion of the total dataset

data = data.dropna(axis=0)


In [19]:
# Display first few rows after handling missing values

data.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


You are interested in the relationship between `verified_status` and `video_view_count`. One approach is to examine the mean value of `video_view_count` for each group of `verified_status` in the sample data.

In [22]:
# Compute the mean `video_view_count` for each group in `verified_status`
video_view_count_mean = data.groupby('verified_status')['video_view_count'].mean()
print(video_view_count_mean)



verified_status
not verified    265663.785339
verified         91439.164167
Name: video_view_count, dtype: float64


### **Task 3. Hypothesis testing**




Your goal in this step is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis test:


1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis



Null Hypothesis **$H_0$**: There is no difference in the number of views (video_view_count) between TikTok videos posted by verified accounts and TikTok videos posted by unverified accounts (verified_status)

Null Hypothesis **$H_A$**: There is a difference in the number of views (video_view_count) between TikTok videos posted by verified accounts and TikTok videos posted by unverified accounts (verified_status)

You choose 5% as the significance level and proceed with a two-sample t-test.

In [25]:
# Conduct a two-sample t-test to compare means
verified = data[data['verified_status'] == 'verified']['video_view_count']
not_verified = data[data['verified_status'] == 'not verified']['video_view_count']
stats.ttest_ind(a=not_verified, b=verified, equal_var=False)


TtestResult(statistic=25.499441780633777, pvalue=2.6088823687177823e-120, df=1571.163074387424)

Given the p-value of approximately 2.609e-120 which is much smaller than the 5% significance level that we had set, this 2 sample t-test has demonstrated that we should reject the Null Hypothesis. Thus, we can conclude that there is a statistically significant difference in video view counts between the verified and unverified users on TikTok.

## **Step 4: Communicate insights with stakeholders**

The analysis shows that there is a statistically significant difference in the average view counts between videos that are uploaded from verified accounts and videos that are uploaded by unverified accounts. This suggests that there could be a difference in behaviors between these two groups.

It would be interesting and potentially insightful to further investigate these differences. Is it perhaps that verified users create more 'clickbait'-type videos? Are they making 'claims' versus 'opinions'?

The next steps will be to build a regression model on verified_status as our end goal will be to make predictions on claim status.