# 1. Introduction

The data team of TikTok wants to develop a machine learning model to classify claims made in videos submitted to the platform by the users.

# 2. Preliminary Data Inspection 

## 2.1 Import Packages and Data

In [2]:
# Importing the relevant libraries
import numpy as np
import pandas as pd

In [3]:
# Load the dataset
data = pd.read_csv('tiktok_dataset.csv')

## 2.2 Understanding the Data

In [4]:
# Get dimensions of the dataset
data.shape

(19382, 12)

In [5]:
# Display the first 10 rows of the dataset
data.head(n=10)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


In [6]:
# Get data types of the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


In [7]:
# Get summary statistics of the dataset
data.describe(include='all')

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19084,19382.0,19382.0,19084,19382,19382,19084.0,19084.0,19084.0,19084.0,19084.0
unique,,2,,,19012,2,3,,,,,
top,,claim,,,a colleague read in the media that butterflie...,not verified,active,,,,,
freq,,9608,,,2,18142,15663,,,,,
mean,9691.5,,5627454000.0,32.421732,,,,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,,2536440000.0,16.229967,,,,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,,1234959000.0,5.0,,,,20.0,0.0,0.0,0.0,0.0
25%,4846.25,,3430417000.0,18.0,,,,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,,5618664000.0,32.0,,,,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,,7843960000.0,47.0,,,,504327.0,125020.0,18222.0,1156.25,292.0


## 2.3 Descriptive Statistics

Since we are interested in classifying the videos as either claims or opinions, we should try to examine the `claim_status` variable.

In [8]:
# Investigate the class balance of the dataset
data['claim_status'].value_counts()

claim_status
claim      9608
opinion    9476
Name: count, dtype: int64

The counts are quite similar.

In [9]:
cmask = data['claim_status'] == 'claim'
claim_data = data[cmask]
claim_data['video_view_count'].agg(['mean', 'median'])

mean      501029.452748
median    501555.000000
Name: video_view_count, dtype: float64

In [10]:
omask = data['claim_status'] == 'opinion'
opinion_data = data[omask]
opinion_data['video_view_count'].agg(['mean', 'median'])

mean      4956.43225
median    4953.00000
Name: video_view_count, dtype: float64

The videos with 'claim' status have much higher view counts and engagement than the videos with 'opinion' status.

In [11]:
# Get counts for each group combination of claim status and author ban status
data.groupby(['claim_status', 'author_ban_status']).count()[['#']]

Unnamed: 0_level_0,Unnamed: 1_level_0,#
claim_status,author_ban_status,Unnamed: 2_level_1
claim,active,6566
claim,banned,1439
claim,under review,1603
opinion,active,8817
opinion,banned,196
opinion,under review,463


The number of banned authors from claim videos are significantly higher than banned authors from opinion videos. A potential explanation could be claims being reported are penalised more than opinions.

We will now investigate the `author_ban_status` variable.

In [12]:
data.groupby(['author_ban_status']).agg({
    'video_view_count': ['mean', 'median'],
    'video_like_count': ['mean', 'median'],
    'video_share_count': ['mean', 'median']})

Unnamed: 0_level_0,video_view_count,video_view_count,video_like_count,video_like_count,video_share_count,video_share_count
Unnamed: 0_level_1,mean,median,mean,median,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
active,215927.039524,8616.0,71036.533836,2222.0,14111.466164,437.0
banned,445845.439144,448201.0,153017.236697,105573.0,29998.942508,14468.0
under review,392204.836399,365245.5,128718.050339,71204.5,25774.696999,9444.0


In [13]:
# Group by author_ban_status and calculate the median of video_share_count
data.groupby(['author_ban_status']).agg({'video_share_count': 'median'})

Unnamed: 0_level_0,video_share_count
author_ban_status,Unnamed: 1_level_1
active,437.0
banned,14468.0
under review,9444.0


The videos of banned authors being shared are much higher than active users. This suggests the possibility of the videos containing controversial content and hence gathered more engagement.

In [14]:
data.groupby(['author_ban_status']).agg({'video_view_count': ['count','mean', 'median'],
    'video_like_count': ['count','mean', 'median'],
    'video_share_count': ['count','mean', 'median']})

Unnamed: 0_level_0,video_view_count,video_view_count,video_view_count,video_like_count,video_like_count,video_like_count,video_share_count,video_share_count,video_share_count
Unnamed: 0_level_1,count,mean,median,count,mean,median,count,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
active,15383,215927.039524,8616.0,15383,71036.533836,2222.0,15383,14111.466164,437.0
banned,1635,445845.439144,448201.0,1635,153017.236697,105573.0,1635,29998.942508,14468.0
under review,2066,392204.836399,365245.5,2066,128718.050339,71204.5,2066,25774.696999,9444.0


Banned and under review users have higher views, likes, and shares than active users. The mean values are significantly higher than median, suggesting outlier videos with very high engagement. 

In [17]:
# Create likes per view column
data['likes_per_view'] = data['video_like_count'] / data['video_view_count']

# Create comments per view column
data['comments_per_view'] = data['video_comment_count'] / data['video_view_count']

# Create shares per view column
data['shares_per_view'] = data['video_share_count'] / data['video_view_count']

data.groupby(['claim_status', 'author_ban_status']).agg({
    'likes_per_view': ['count','mean', 'median'],
    'comments_per_view': ['count','mean', 'median'],
    'shares_per_view': ['count','mean', 'median']})

Unnamed: 0_level_0,Unnamed: 1_level_0,likes_per_view,likes_per_view,likes_per_view,comments_per_view,comments_per_view,comments_per_view,shares_per_view,shares_per_view,shares_per_view
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,count,mean,median,count,mean,median
claim_status,author_ban_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
claim,active,6566,0.329542,0.326538,6566,0.001393,0.000776,6566,0.065456,0.049279
claim,banned,1439,0.345071,0.358909,1439,0.001377,0.000746,1439,0.067893,0.051606
claim,under review,1603,0.327997,0.320867,1603,0.001367,0.000789,1603,0.065733,0.049967
opinion,active,8817,0.219744,0.21833,8817,0.000517,0.000252,8817,0.043729,0.032405
opinion,banned,196,0.206868,0.198483,196,0.000434,0.000193,196,0.040531,0.030728
opinion,under review,463,0.226394,0.228051,463,0.000536,0.000293,463,0.044472,0.035027


Previously, we know that banned and under review authors have higher views, likes, and shares than active authors. Now, a video's engagement rate seem to be more related to its `claim_status` than `author_ban_status`.

We also knew that claim videos have more views than opinion videos. This result tells claim videos also have a higher rate of likes as well as comments and shares on average, so they are more well received.

# 3. Exploratory Data Analysis

# 4. Data Exploration and Hypothesis Testing

# 5. Regression Modeling

# 6. Classifying Videos with Machine Learning