# **TikTok Project - Exploratory Data Analysis (part 1)**

The data team is still in the early stages of the project. You have received notice by the project manager that TikTok's leadership team has approved the project proposal. To gain clear insights to prepare for a claims classification model, TikTok's provided data must be examined to begin the process of exploratory data analysis (EDA). Note: the material in this notebook covers course 2.

### **Inspect and analyze data** ---

Here we will start examining the data by the Senior Data Scientist in our team and prepare it for analysis.
<br/>

**The purpose** of this project is to investigate and understand the data provided. This activity will:
1.   Acquaint you with the dataset
2.   Compile summary information about the dataset
3.   Begin the process of EDA revealing the content of the dataset
4.   Prepare you for more in-depth EDA (i.e. hypothesis testing, and statistical analysis)

**The goal** is to construct a dataframe in Python, perform a cursory inspection of the provided dataset, and inform TikTok data team members of my findings.

To start, we import necessary libraries then, load the dataset into a dataframe. Creating a dataframe will help you conduct data manipulation, exploratory data analysis (EDA), and statistical activities...

In [7]:
# Import libraries
import pandas as pd
import numpy as np

# Load dataset
data = pd.read_csv("tiktok_dataset.csv")

Let's view and inspect summary information about the dataframe ... `.head()`, `.info()` and `.describe()`

In [8]:
# Display and examine the first 10 rows of the dataframe
print("First 10 rows of the dataset:")
data.head(10)

First 10 rows of the dataset:


Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


In [9]:
# Get summary information about the dataset
print("\nSummary information about the dataset:")
data.info()


Summary information about the dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


In [10]:
# Get summary statistics of the dataset
print("\nSummary statistics of the dataset:")
data.describe()


Summary statistics of the dataset:


Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


- Let's consider some key question for familiarizing with this dataset:

    **Question 1:** When reviewing the first few rows of the dataframe, what do you observe about the data? What does each row represent?

    **Question 2:** When reviewing the `data.info()` output, what do you notice about the different variables? Are there any null values? Are all of the variables numeric? Does anything else stand out?

    **Question 3:** When reviewing the `data.describe()` output, what do you notice about the distributions of each variable? Are there any questionable values? Does it seem that there are outlier values?

- Let's answer them:

    **Answer 1:** The dataframe contains a collection of categorical, text, and numerical data. Each row represents a distinct TikTok video that presents either a claim or an opinion and the accompanying metadata about that video.

    **Answer 2:** The dataframe contains five float64s, three int64s, and four objects. There are 19,382 observations, but some of the variables are missing values, including claim status, the video transcripton, and all of the count variables.

    **Answer 3:** Many of the count variables seem to have outliers at the high end of the distribution. They have very large standard deviations and maximum values that are very high compared to their quartile values.

Let's start investigating the variables more closely to better understand them...

We know that ultimate objective is to use machine learning to classify videos as either claims or opinions. A good first step towards understanding the data might therefore be examining the `claim_status` variable (out raget variable). We can begin by determining how many videos there are for each different claim status.

In [13]:
# Use value_counts to get the distribution of claim_status
print("\nDistribution of claim_status:")
data['claim_status'].value_counts()


Distribution of claim_status:


claim_status
claim      9608
opinion    9476
Name: count, dtype: int64

... the counts of each claim status are quite balanced! (this will be useful to know down the line, no need to rebalance the dataset)

Next, examine the engagement trends associated with each different claim status. Let's do this by using Boolean masking to filter the data according to claim status, then calculate the mean and median view counts for each claim status.

In [None]:
# Create a boolean mask for claim status
claims_mask = data['claim_status'] == 'claim'
# Use the mask to filter the dataframe
claims = data[claims_mask]
# Calculate and print the mean and median view count for the claim status
print('Mean view count claims:', claims['video_view_count'].mean())
print('Median view count claims:', claims['video_view_count'].median())

Mean view count claims: 501029.4527477102
Median view count claims: 501555.0


In [15]:
# Create a boolean mask for claim status
opinions_mask = data['claim_status'] == 'opinion'
# Use the mask to filter the dataframe
opinions = data[opinions_mask]
# Calculate and print the mean and median view count for the opinion status
print('Mean view count opinions:', opinions['video_view_count'].mean())
print('Median view count opinions:', opinions['video_view_count'].median())

Mean view count opinions: 4956.43224989447
Median view count opinions: 4953.0


... mean and the median within each claim category are close to one another, but there is a vast discrepancy (roughly a factor 10) between view counts for videos labeled as claims and videos labeled as opinions! (this could be useful for prediction? ... we might want to investigate this in a more statistically rigorous way)

Next, we can examine trends associated with the ban status of the author. We will use `groupby()` to calculate how many videos there are for each combination of categories of claim status and author ban status.

In [18]:
# Get counts for each group combination of claim status and author ban status and select the count column only
data.groupby(['claim_status', 'author_ban_status']).count()[['#']]

Unnamed: 0_level_0,Unnamed: 1_level_0,#
claim_status,author_ban_status,Unnamed: 2_level_1
claim,active,6566
claim,banned,1439
claim,under review,1603
opinion,active,8817
opinion,banned,196
opinion,under review,463


...there are significantly more claim videos from banned authors than opinion videos from banned authors. This could suggest a few things:

- Claim videos might be more closely monitored, leading to more bans.

- Banned authors might be more likely to post claim videos.

- There might be an underlying factor linking claim videos and author bans.

However, this data does not tell us:

- Whether claim videos are actually more likely to violate rules.

- Whether posting a claim video directly causes a ban.

- Whether authors were banned because of these videos.

In short, while we can study patterns involving author status (active vs. banned), we cannot conclude anything about whether specific videos caused bans.


Let's continue investigating engagement levels, now focusing on `author_ban_status`. Let's calculate the median video share count of each author ban status using the `.agg` method.

In [None]:
# Get the mean and median of video view count, like count, and share count for each author ban status (using agg function)
data.groupby(['author_ban_status']).agg(
    {'video_view_count': ['mean', 'median'],
     'video_like_count': ['mean', 'median'],
     'video_share_count': ['mean', 'median']})

Unnamed: 0_level_0,video_view_count,video_view_count,video_like_count,video_like_count,video_share_count,video_share_count
Unnamed: 0_level_1,mean,median,mean,median,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
active,215927.039524,8616.0,71036.533836,2222.0,14111.466164,437.0
banned,445845.439144,448201.0,153017.236697,105573.0,29998.942508,14468.0
under review,392204.836399,365245.5,128718.050339,71204.5,25774.696999,9444.0


In [21]:
# Get only the median of video view count, like count, and share count for each author ban status (using median function)
data.groupby(['author_ban_status']).median(numeric_only=True)[['video_share_count']]

Unnamed: 0_level_0,video_share_count
author_ban_status,Unnamed: 1_level_1
active,437.0
banned,14468.0
under review,9444.0


... banned authors have a median share count that's 33 times the median share count of active authors! 

Let's explore this in more depth:

Useing `groupby()` to group the data by `author_ban_status`, then useing `agg()` to get the count, mean, and median of each of the following columns:
* `video_view_count`
* `video_like_count`
* `video_share_count`

NB: remember, the argument for the `agg()` function is a dictionary whose keys are columns. The values are list of calculations needed.

In [25]:
# Use groupby and agg to get the count, mean, and median of video view count, like count, and share count for each author ban status
data.groupby(['author_ban_status']).agg(
    {'video_view_count': ['count', 'mean', 'median'],
     'video_like_count': ['count', 'mean', 'median'],
     'video_share_count': ['count', 'mean', 'median']
     })

Unnamed: 0_level_0,video_view_count,video_view_count,video_view_count,video_like_count,video_like_count,video_like_count,video_share_count,video_share_count,video_share_count
Unnamed: 0_level_1,count,mean,median,count,mean,median,count,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
active,15383,215927.039524,8616.0,15383,71036.533836,2222.0,15383,14111.466164,437.0
banned,1635,445845.439144,448201.0,1635,153017.236697,105573.0,1635,29998.942508,14468.0
under review,2066,392204.836399,365245.5,2066,128718.050339,71204.5,2066,25774.696999,9444.0


... a few observations stand out:
* Banned authors and those under review get far more views, likes, and shares than active authors.
* In most groups, the mean is much greater than the median, which indicates that there are some videos with very high engagement counts.

Now, let's create three new columns to help better understand engagement rates (normalize per view):
* `likes_per_view`: represents the number of likes divided by the number of views for each video
* `comments_per_view`: represents the number of comments divided by the number of views for each video
* `shares_per_view`: represents the number of shares divided by the number of views for each video

In [None]:
# Create a likes_per_view column
data['likes_per_view'] = data['video_like_count'] / data['video_view_count']
# Create a comments_per_view column
data['comments_per_view'] = data['video_comment_count'] / data['video_view_count']
# Create a shares_per_view column
data['shares_per_view'] = data['video_share_count'] / data['video_view_count']

Now let's use `groupby()` to compile the information in each of the three newly created columns for each combination of categories of claim status and author ban status, then use `agg()` to calculate the count, the mean, and the median of each group.

In [27]:
# Use groupby and agg to get the count, mean, and median of likes_per_view, comments_per_view, and shares_per_view for each claim status and author ban status
data.groupby(['claim_status', 'author_ban_status']).agg(
    {'likes_per_view': ['count', 'mean', 'median'],
     'comments_per_view': ['count', 'mean', 'median'],
     'shares_per_view': ['count', 'mean', 'median']})

Unnamed: 0_level_0,Unnamed: 1_level_0,likes_per_view,likes_per_view,likes_per_view,comments_per_view,comments_per_view,comments_per_view,shares_per_view,shares_per_view,shares_per_view
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,count,mean,median,count,mean,median
claim_status,author_ban_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
claim,active,6566,0.329542,0.326538,6566,0.001393,0.000776,6566,0.065456,0.049279
claim,banned,1439,0.345071,0.358909,1439,0.001377,0.000746,1439,0.067893,0.051606
claim,under review,1603,0.327997,0.320867,1603,0.001367,0.000789,1603,0.065733,0.049967
opinion,active,8817,0.219744,0.21833,8817,0.000517,0.000252,8817,0.043729,0.032405
opinion,banned,196,0.206868,0.198483,196,0.000434,0.000193,196,0.040531,0.030728
opinion,under review,463,0.226394,0.228051,463,0.000536,0.000293,463,0.044472,0.035027


...key insights from engagement metrics:

- Claim videos outperform opinion videos in engagement across the board: Higher likes per view, comments per view, and shares per view. This suggests claim videos are more engaging regardless of author status.

- Author ban status has limited impact on engagement: Within claim videos, banned authors actually show slightly higher likes and shares per view than active authors. For opinion videos, banned authors consistently show the lowest engagement across all metrics.

Overall interpretation:

1) Engagement is driven more by content type (claim vs opinion) than by whether the author is banned.
2) Claim videos, in general, attract more attention and interaction.
3) The lower engagement from banned authors on opinion videos may reflect content quality, trust issues, or moderation dynamics.

### **How can we summarize our findings for the TikTok data team lead?** ---

**Questions:**

*   What percentage of the data is comprised of claims and what percentage is comprised of opinions?
*   What factors correlate with a video's claim status?
*   What factors correlate with a video's engagement level?








**Answers:**

* Of the 19,382 samples in this dataset, just under 50% are claims (9,608 of them).  
* Engagement level is strongly correlated with claim status. This should be a focus of further inquiry.
* Videos with banned authors have significantly higher engagement than videos with active authors. Videos with authors under review fall between these two categories in terms of engagement levels.