# TikTok - Preliminary Data Summary

This work prepares a dataset for future exploratory data analysis (EDA). The purpose is to investigate and understand the data provided. The goal is to leverage Python dataframes in order to perform cursory data inspection and inform team members of the findings. At the end of this stage, the data must be ready to answer questions, yield insights, produce visualizations and be tested through future hypothesis testing and statistical methods.

In [42]:
import pandas as pd
import numpy as np

In [43]:
# reading the data
df = pd.read_csv("tiktok_dataset.csv")
df.head(10)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


The dataset has 19382 samples and contains numerical, categorical and text data. Each row of the dataframe represents a distinct TikTok video that presents either a claim or an opinion. For each video the dataset includes the relevant metadata. Some of the variables are missing values, including the variables for the claim status, the video transcripton and the different counts.

In [45]:
df.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


We observe that some of the count variables have outliers at the upper end of their distribution. Their standard deviations and maximum values are very large, especially compared to their quartile values.

In [46]:
# determining what are the different values for claim status and how many of each are in the data
df["claim_status"].value_counts()

Unnamed: 0_level_0,count
claim_status,Unnamed: 1_level_1
claim,9608
opinion,9476


In [48]:
# finding the videos that are marked as claims
claims = df[df["claim_status"] == "claim"]

# finding the videos that are marked as opinions
opinions = df[df["claim_status"] == "opinion"]

# calculating statistics for view count of videos
print("----- Videos marked as claims -----")
print("Avg view count:", claims["video_view_count"].mean())
print("Median view count:", claims["video_view_count"].median())
print("Max view count:", claims["video_view_count"].max())
print("Min view count:", claims["video_view_count"].min())
print("Avg duration (sec):", claims["video_duration_sec"].mean())
print("Median duration (sec):", claims["video_duration_sec"].median())
print("Max duration (sec):", claims["video_duration_sec"].max())
print("Min duration (sec):", claims["video_duration_sec"].min())
print("\n----- Videos marked as opinions -----")
print("Avg view count:", opinions["video_view_count"].mean())
print("Median view count:", opinions["video_view_count"].median())
print("Max view count:", opinions["video_view_count"].max())
print("Min view count:", opinions["video_view_count"].min())
print("Avg duration (sec):", opinions["video_duration_sec"].mean())
print("Median duration (sec):", opinions["video_duration_sec"].median())
print("Max duration (sec):", opinions["video_duration_sec"].max())
print("Min duration (sec):", opinions["video_duration_sec"].min())

----- Videos marked as claims -----
Avg view count: 501029.4527477102
Median view count: 501555.0
Max view count: 999817.0
Min view count: 1049.0
Avg duration (sec): 32.48688592839301
Median duration (sec): 32.0
Max duration (sec): 60
Min duration (sec): 5

----- Videos marked as opinions -----
Avg view count: 4956.43224989447
Median view count: 4953.0
Max view count: 9998.0
Min view count: 20.0
Avg duration (sec): 32.359856479527224
Median duration (sec): 32.0
Max duration (sec): 60
Min duration (sec): 5


In [36]:
# getting counts for each group combination of claim status and author ban status
lst_columns = ["claim_status", "author_ban_status"]
df.groupby(lst_columns).count()[['#']]

Unnamed: 0_level_0,Unnamed: 1_level_0,#
claim_status,author_ban_status,Unnamed: 2_level_1
claim,active,6566
claim,banned,1439
claim,under review,1603
opinion,active,8817
opinion,banned,196
opinion,under review,463


We see that the claim categories have balanced view counts (9608 for videos labeled as claims and 9476 for videos labeled as opinions). We also observe that within each claim category the mean and the median are close to one another. Nevertheless, when we compare the two categories in terms of mean and median view count, there is a vast discrepancy. Additionally, it should be noted that there are many more claim videos with banned authors than there are opinion videos with banned authors, hinting at claim videos being more strictly policed than opinion videos and authors having to comply with a stricter set of rules if they post a claim than if they post an opinion.

In [37]:
df.groupby(["author_ban_status"]).agg( {"video_view_count": ["mean", "median"],
                                        "video_like_count": ["mean", "median"],
                                        "video_share_count": ["mean", "median"]} )

Unnamed: 0_level_0,video_view_count,video_view_count,video_like_count,video_like_count,video_share_count,video_share_count
Unnamed: 0_level_1,mean,median,mean,median,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
active,215927.039524,8616.0,71036.533836,2222.0,14111.466164,437.0
banned,445845.439144,448201.0,153017.236697,105573.0,29998.942508,14468.0
under review,392204.836399,365245.5,128718.050339,71204.5,25774.696999,9444.0


In [38]:
# calculating the median video share count of each author ban status
df.groupby(["author_ban_status"]).median(numeric_only=True)[["video_share_count"]]

Unnamed: 0_level_0,video_share_count
author_ban_status,Unnamed: 1_level_1
active,437.0
banned,14468.0
under review,9444.0


In [39]:
df.groupby(["author_ban_status"]).agg( {"video_view_count": ["count", "mean", "median"],
                                        "video_like_count": ["count", "mean", "median"],
                                        "video_share_count": ["count", "mean", "median"]} )

Unnamed: 0_level_0,video_view_count,video_view_count,video_view_count,video_like_count,video_like_count,video_like_count,video_share_count,video_share_count,video_share_count
Unnamed: 0_level_1,count,mean,median,count,mean,median,count,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
active,15383,215927.039524,8616.0,15383,71036.533836,2222.0,15383,14111.466164,437.0
banned,1635,445845.439144,448201.0,1635,153017.236697,105573.0,1635,29998.942508,14468.0
under review,2066,392204.836399,365245.5,2066,128718.050339,71204.5,2066,25774.696999,9444.0


It appears that banned authors have a median share count that is approximately 33 times the median share count of active authors. Furthermore, banned authors and those under review get far more views, likes and shares than active authors, while in most groups the mean is much greater than the median, which indicates that there are some videos with very high engagement counts.

In [40]:
# creating a column for the amount of likes per view
df["likes_per_view"] = df["video_like_count"] / df["video_view_count"]

# creating a column for the amount of comments per view
df["comments_per_view"] = df["video_comment_count"] / df["video_view_count"]

# creating a column for the amount of shares per view
df["shares_per_view"] = df["video_share_count"] / df["video_view_count"]

In [41]:
df.groupby(lst_columns).agg( {"likes_per_view": ["count", "mean", "median"],
                              "comments_per_view": ["count", "mean", "median"],
                              "shares_per_view": ["count", "mean", "median"]} )

Unnamed: 0_level_0,Unnamed: 1_level_0,likes_per_view,likes_per_view,likes_per_view,comments_per_view,comments_per_view,comments_per_view,shares_per_view,shares_per_view,shares_per_view
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,count,mean,median,count,mean,median
claim_status,author_ban_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
claim,active,6566,0.329542,0.326538,6566,0.001393,0.000776,6566,0.065456,0.049279
claim,banned,1439,0.345071,0.358909,1439,0.001377,0.000746,1439,0.067893,0.051606
claim,under review,1603,0.327997,0.320867,1603,0.001367,0.000789,1603,0.065733,0.049967
opinion,active,8817,0.219744,0.21833,8817,0.000517,0.000252,8817,0.043729,0.032405
opinion,banned,196,0.206868,0.198483,196,0.000434,0.000193,196,0.040531,0.030728
opinion,under review,463,0.226394,0.228051,463,0.000536,0.000293,463,0.044472,0.035027


We know that videos created by banned authors or those under review tend to receive significantly more views, shares and likes compared to videos by non-banned authors. However, once a video is watched, its engagement rate is less influenced by whether the author is banned and more by the video's claim status. Moreover, claim videos tend to receive more views than opinion videos, but this also indicates that claim videos typically get a higher average number of likes, showing they are more positively received. They also see greater engagement through comments and shares compared to opinion videos. More specifically, for claim videos, content by banned authors shows slightly higher rates of shares per view and likes per view than videos by active authors or those under review. On the other hand, for opinion videos, active authors and those under review generally have higher engagement rates than banned authors across all categories.