# TikTok Video Classification Project: Hypothesis Testing
*Purpose*\
TikTok is a platform for producing and viewing short-term mobile videos. Users of the platform can report videos and comments that contain claims versus opinions. With the high number of submissions and interactions on TikTok each day, it is challenging for human moderators to review efficiently each video, comments, and claims concerning content. TikTok wants to reduce the backlog of user reports and prioritize claim reports. **The goal of this project is to mitigate misinformation in videos on the TikTok platform by building a reliable machine learning model which will help reduce report backlog**.

* An *opinion* is a personal or group belief or thought concerning any information, action, thought, person, or group, place, or thing
* A *claim* is unqualified information concerning any information, action, thought, person, or group, place, or thing

As presented by TikTok: “any answers, responses, comments, opinions, analysis or recommendations that you are not properly licensed or otherwise qualified to provide (https://www.tiktok.com/legal/page/us/terms-of-service/en ).” \
TikTok safety: https://newsroom.tiktok.com/en-us/safety

*Deliverables*
> **Appendix 1: Hypothesis Testing, Two-sample t-tests**\
Several two-sample hypothesis tests (t-test) to ascertain if there is a statistically significant difference or a random sampling occurrence in *mean* in `video_view_count` and the statuses: `verified_status`, `claim_status`, and `author_ban_status`.

*Data*\
The data set used here comes from the Google Advanced Data Analytics Professional Certificate course on the Coursera platform: https://www.coursera.org/google-certificates/advanced-data-analytics-certificate

*Code*\
All code for this project is located at: https://github.com/izsolnay/TikTok_Python

# Appendix 1: Hypothesis Testing, Two-sample t-tests

In [1]:
# Import standard operational packages
import pandas as pd
import numpy as np

# Import additional statistical package
from scipy import stats

# Set Jupyter to display all of the columns (no redaction)
pd.set_option('display.max_columns', None)

In [2]:
# Import data; create df
df0 = pd.read_csv('TikTok_clean.csv')
df0.head()

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count,likes_per_view,comments_per_view,shares_per_view
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0,0.056584,0.0,0.000702
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0,0.549096,0.004855,0.135111
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0,0.108282,0.000365,0.003168
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0,0.548459,0.001335,0.079569
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0,0.62291,0.002706,0.073175


In [3]:
df0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19084 entries, 0 to 19083
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19084 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19084 non-null  int64  
 3   video_duration_sec        19084 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19084 non-null  object 
 6   author_ban_status         19084 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
 12  likes_per_view            19084 non-null  float64
 13  comments_per_view         19084 non-null  float64
 14  shares

In [4]:
# Retype `#` and `video_id` as obj
df0[['#', 'video_id']] = df0[['#', 'video_id']].astype('object')
df = df0.copy()
df.dtypes

#                            object
claim_status                 object
video_id                     object
video_duration_sec            int64
video_transcription_text     object
verified_status              object
author_ban_status            object
video_view_count            float64
video_like_count            float64
video_share_count           float64
video_download_count        float64
video_comment_count         float64
likes_per_view              float64
comments_per_view           float64
shares_per_view             float64
dtype: object

In [5]:
df.describe().round(2)

Unnamed: 0,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count,likes_per_view,comments_per_view,shares_per_view
count,19084.0,19084.0,19084.0,19084.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,32.42,254708.56,84304.64,16735.25,1049.43,349.31,0.28,0.0,0.05
std,16.23,322893.28,133420.55,32036.17,2004.3,799.64,0.17,0.0,0.05
min,5.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,18.0,4942.5,810.75,115.0,7.0,1.0,0.13,0.0,0.01
50%,32.0,9954.5,3403.5,717.0,46.0,9.0,0.26,0.0,0.04
75%,47.0,504327.0,125020.0,18222.0,1156.25,292.0,0.4,0.0,0.08
max,60.0,999817.0,657830.0,256130.0,14994.0,9599.0,0.67,0.01,0.27


## Hypothesis testing: Welch's t-test
Welch's t-test assumes unequal variances in population (no reason to assume same variance here)\
(Variance: the average of the squared difference of each data point from the mean)
1. state the NULL hypothesis ($H_0$) and the alternative hypothesis ($H_a$)
    * $H_0$: there is no statistical difference – not reject, any difference is CHANCE
    * $H_a$: there is a statistical difference - REJECT, not due to chance; there is a relationship.
2. choose a significance level: 5%
3. find the p-value; stats.ttest_ind() function to perform the test
4. reject or fail to reject the NULL hypothesis

In [6]:
# Set significance level
significance_level = 0.05
significance_level

0.05

### Two-sample t-test for statistical significance between verification statuses and mean video views 
* $H_0$: there is no statistical difference between the *mean* number of video views between verified and unverified authors – any difference is CHANCE
* $H_a$: there is a statistical difference; REJECT $H_0$, because difference is not due to chance; there is a relationship.

In [7]:
# Calculate count and % of verified_status in full dataset
print(df.shape)
print(df['verified_status'].value_counts()) # value_counts counts the # of times appears
print(df['verified_status'].value_counts(normalize = True).mul(100).round(1).astype(str) + '%')  # normalize = True displays in percentages

(19084, 15)
verified_status
not verified    17884
verified         1200
Name: count, dtype: int64
verified_status
not verified    93.7%
verified         6.3%
Name: proportion, dtype: object


In [8]:
# Calculate median `video_view_count` for each group in `verified_status`
df.groupby('verified_status')['video_view_count'].median()

verified_status
not verified    46723.0
verified         6023.5
Name: video_view_count, dtype: float64

In [9]:
# Calculate mean `video_view_count` for each group in `verified_status`
df.groupby('verified_status')['video_view_count'].mean()

verified_status
not verified    265663.785339
verified         91439.164167
Name: video_view_count, dtype: float64

In [10]:
# Conduct a two-sample t-test to compare means
# Isolate not_verified and verified
not_verified = df[df['verified_status'] == 'not verified']['video_view_count']
verified = df[df['verified_status'] == 'verified']['video_view_count']

# Perform t-test
stats.ttest_ind(a=not_verified, b=verified, equal_var=False) # equal_var=False to not assume population variances are =

TtestResult(statistic=25.499441780633777, pvalue=2.6088823687177823e-120, df=1571.163074387424)

#### Results
There would seem to be a large difference between the medians and averages between verified and not verified authors    

And, indeed the p-value is phenomenally small (2.6088823687177823e-120 < 0.05) so the NULL hypothesis can be rejected \
There is a statistical difference between the means of video views which are by verified versus unverified authors.

### Conduct a Two-sample t-test statistical significance between claim statuses and mean video views
* $H_0$: there is no statistical difference between the *mean* number of video views between opinion and claim videos – any difference is CHANCE
* $H_a$: there is a statistical difference; REJECT $H_0$, because difference is not due to chance; there is a relationship.

In [11]:
# Calculate count and % of claim_status in full dataset
print(df['claim_status'].value_counts()) 
print(df['claim_status'].value_counts(normalize = True).mul(100).round(1).astype(str) + '%') 

claim_status
claim      9608
opinion    9476
Name: count, dtype: int64
claim_status
claim      50.3%
opinion    49.7%
Name: proportion, dtype: object


In [12]:
# Calculate median `video_view_count` for each group in `claim_status`
df.groupby('claim_status')['video_view_count'].median()

claim_status
claim      501555.0
opinion      4953.0
Name: video_view_count, dtype: float64

In [13]:
# Calculate median `video_view_count` for each group in `claim_status`
df.groupby('claim_status')['video_view_count'].mean()

claim_status
claim      501029.452748
opinion      4956.432250
Name: video_view_count, dtype: float64

In [14]:
# Conduct a two-sample t-test to compare means
# Isolate claim and opinion
claim = df[df['claim_status'] == 'claim']['video_view_count']
opinion = df[df['claim_status'] == 'opinion']['video_view_count']

# Perform t-test
stats.ttest_ind(a=claim, b=opinion, equal_var=False) # equal_var=False to not assume population variances are =

TtestResult(statistic=166.88857822856752, pvalue=0.0, df=9608.91144749953)

##### Results
There would seem to be a large difference between the medians and averages between claim and opinion videos    

And, indeed the p-value is 0.00 (< 0.05) so the NULL hypothesis can be rejected. \
There is a statistical difference between the means of video views which are labelled claim versus opinion.

### Conduct a Two-sample t-test statistical significance between author ban statuses and mean video views
$H_0$: there is no statistical difference between the *mean* number of video views between author ban statuses – any difference is CHANCE\
$H_a$: there is a statistical difference; REJECT $H_0$, because difference is not due to chance; there is a relationship
  
1) banned vs active
2) banned vs under review
3) active vs under review

In [15]:
# Calculate count and % of author_ban_status in full dataset
print(df['author_ban_status'].value_counts()) 
print(df['author_ban_status'].value_counts(normalize = True).mul(100).round(1).astype(str) + '%') 

author_ban_status
active          15383
under review     2066
banned           1635
Name: count, dtype: int64
author_ban_status
active          80.6%
under review    10.8%
banned           8.6%
Name: proportion, dtype: object


In [16]:
# Calculate median `video_view_count` for each group in `author_ban_status`
df.groupby('author_ban_status')['video_view_count'].median()

author_ban_status
active            8616.0
banned          448201.0
under review    365245.5
Name: video_view_count, dtype: float64

In [17]:
# Calculate median `video_view_count` for each group in `author_ban_status`
df.groupby('author_ban_status')['video_view_count'].mean()

author_ban_status
active          215927.039524
banned          445845.439144
under review    392204.836399
Name: video_view_count, dtype: float64

In [18]:
# Conduct a two-sample t-test to compare means
# Isolate banned and active
banned = df[df['author_ban_status'] == 'banned']['video_view_count']
active = df[df['author_ban_status'] == 'active']['video_view_count']

# Perform t-test
stats.ttest_ind(a=banned, b=active, equal_var=False)

TtestResult(statistic=28.105495839234667, pvalue=1.2902882827873965e-146, df=1986.4599754382912)

In [19]:
# Conduct a two-sample t-test to compare means
# Isolate banned and under_review
banned = df[df['author_ban_status'] == 'banned']['video_view_count']
under_review = df[df['author_ban_status'] == 'under review']['video_view_count']

# Perform t-test
stats.ttest_ind(a=banned, b=under_review, equal_var=False)

TtestResult(statistic=5.0417127756548235, pvalue=4.843209118056266e-07, df=3570.660053763425)

In [20]:
# Conduct a two-sample t-test to compare means
# Isolate active and under_review
active = df[df['author_ban_status'] == 'active']['video_view_count']
under_review = df[df['author_ban_status'] == 'under review']['video_view_count']

# Perform t-test
stats.ttest_ind(a=active, b=under_review, equal_var=False) 

TtestResult(statistic=-22.9890878218809, pvalue=1.4909087700456904e-106, df=2581.6019458184783)

#### Results
The p-values are phenomenally small for all three tests. All NULL hypothesis can be rejected. 
1) there is a statistical difference between the means of video views which are by banned versus active authors.
2) there is a statistical difference between the means of video views which are by banned versus under review authors.
3) there is a statistical difference between the means of video views which are by active versus under review authors.

### Results
There is a statistical difference between the means of video views which are:
* by verified versus unverified authors
* labelled claim versus opinion
* by banned versus active authors
* by banned versus under review authors
* by active versus under review authors

The social reasons for why the differences is of course a curious question, but relevant here is simply that these differences are not due to chance. Further work would have to be performed to discover why. 
* Are banned authors and under review more exciting because they are banned?
* Do banned authors and under review authors post more salacious content?
* Are claims seemingly more convincing?
* And, how does verification play in? Is it good? Or, less convincing?