# **TikTok Project**

# **Inspect and analyze data**

**The purpose** of this notebook is to investigate and understand the data provided. 

**The goal** is to perform a cursory inspection of the provided dataset and inform TikTok data team members of your findings.

# **Identify data types and compile summary information**


# **PACE stages**

## **PACE: Plan**

### **Understand the situation**

*   How can the provided information be best prepared for better understanding and organization?


    To best prepare to understand and organize the provided information, you can follow these steps:

    1. Review the dataset: Get familiar with the structure of the data, including the columns, data types, and any potential missing values.
    
    2. Identify key questions: Understand the objectives of the project, the problem you're solving, and what insights the team is looking for from the data.
    
    3. Conduct an initial assessment: Perform some basic descriptive statistics (e.g., mean, median, standard deviation) and visualize the data to identify trends, anomalies, and patterns.
    
    4. Organize data into categories: Categorize the data into relevant sections based on the types of analysis you plan to perform (e.g., numeric vs. categorical data).
    
    5. Document observations: Take notes on any peculiarities, such as missing values, outliers, or inconsistencies in the data.
    
    6. Determine next steps: Decide on any further data cleaning, transformation, or feature engineering needed before proceeding to exploratory data analysis (EDA).

## **PACE: Analyze**

### **Imports and data loading**

In [8]:
# Import packages
import pandas as pd
import numpy as np

In [9]:
# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")

### **Understand the data - Inspect the data**

In [10]:
# Display and examine the first ten rows of the dataframe
data.head(10)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


In [6]:
# Get summary info
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


In [11]:
# Get summary statistics
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


###### Question 1: When reviewing the first few rows of the dataframe, what do you observe about the data? What does each row represent?

    Upon reviewing the first few rows of the dataframe, it is clear that each row represents an individual video claim made by an author. The claim_status column indicates the status of the claim and the video_id is a unique identifier for each video. The video_duration_sec column shows the length of the video in seconds, and the video_transcription_text provides a brief summary or transcription related to the video.

    Other columns include verified_status, which indicates whether the claim has been verified or not, and author_ban_status, which reflects the ban status of the author. Additionally, the dataframe includes engagement metrics like video_view_count, video_like_count, video_share_count, video_download_count, and video_comment_count, which show the level of interaction or engagement with each video.
    
    
Question 2: When reviewing the data.info() output, what do you notice about the different variables? Are there any null values? Are all of the variables numeric? Does anything else stand out?

    Upon reviewing the data.info() output, it is noticeable that not all variables are numeric. The dataframe contains variables of different data types: int64, float64, and object. Specifically, the columns claim_status, video_transcription_text, verified_status, and author_ban_status are of type object, while the rest are numeric (int64 for counts and durations, and float64 for video interaction metrics).

    There are some null values present in the data. The claim_status, video_transcription_text, and several video interaction columns (video_view_count, video_like_count, video_share_count, video_download_count, video_comment_count) have missing values.

    One other point to note is the memory usage of the dataframe, which is 1.8+ MB, indicating that the dataset contains a considerable amount of data (19,382 entries). The missing values in key interaction metrics may need to be addressed during the data cleaning or preprocessing steps.
    
    
Question 3: When reviewing the data.describe() output, what do you notice about the distributions of each variable? Are there any questionable values? Does it seem that there are outlier values?

    Upon reviewing the data.describe() output, several observations can be made regarding the distributions of the variables:

    The video duration (video_duration_sec) ranges from a minimum of 5 seconds to a maximum of 60 seconds.

    The video_view_count column shows a highly skewed distribution, with a mean of 254,708 but a median of only 9,954.5, indicating that a few viral videos drive up the average. The max value of 999,817 suggests extreme outliers.

    The video_like_count also exhibits skewness, with a mean of 84,304 and a median of 3,403.5, meaning most videos receive relatively few likes while a handful accumulate extremely high engagement.

    The video_share_count follows a similar pattern, with a median of 717 compared to a mean of 16,735, showing that a small number of videos receive a disproportionate number of shares.

    The video_comment_count is similarly skewed, with a median of 46 but a mean of 1,049, suggesting that a few highly discussed videos dominate the dataset.

    The presence of zeros in video_share_count, video_download_count, and video_comment_count suggests that some videos receive minimal interaction, possibly indicating lower relevance or engagement.


### **Understand the data - Investigate the variables**

Let's investigate the variables more closely to better understand them.
A good first step towards understanding the data might therefore be examining the `claim_status` variable. Let's  determine how many videos there are for each different claim status.

In [12]:
# What are the different values for claim status and how many of each are in the data?
data.claim_status.value_counts()

claim_status
claim      9608
opinion    9476
Name: count, dtype: int64

The values for the claim_status variable indicate that the dataset is nearly balanced, with 9,608 videos classified as "claim" and 9,476 as "opinion." This near-equal distribution suggests that the dataset is well-suited for training a classification model without significant class imbalance issues.

Next, let's examine the engagement trends associated with each different claim status.

Start by using Boolean masking to filter the data according to claim status, then calculate the mean and median view counts for each claim status.

In [13]:
# What is the average view count of videos with "claim" status?
avg_claim_views = data[data.claim_status == 'claim']["video_view_count"].mean()
median_claim_views = data[data["claim_status"] == "claim"]["video_view_count"].median()

print(f"Average: {avg_claim_views}, Median: {median_claim_views}")

Average: 501029.4527477102, Median: 501555.0


In [14]:
# What is the average view count of videos with "opinion" status?
avg_opinion_views = data[data.claim_status == 'opinion']["video_view_count"].mean()
median_opinion_views = data[data["claim_status"] == "opinion"]["video_view_count"].median()

print(f"Average: {avg_opinion_views}, Median: {median_opinion_views}")

Average: 4956.43224989447, Median: 4953.0


The difference between the mean and median view counts within each claim category suggests that the data is likely skewed. Here's what can be inferred:

**For "claim" videos:**

The mean (501,029.45) is slightly lower than the median (501,555.0).
This indicates that the distribution of view counts is slightly negatively skewed, meaning there are a few videos with exceptionally high view counts, which pulls the mean down a bit. However, the values are quite close, suggesting that the data might be fairly symmetric.

**For "opinion" videos:**

The mean (4,956.43) and median (4,953.0) are also very close to each other.
This suggests that the distribution of view counts for "opinion" videos is fairly symmetric, without extreme outliers affecting the mean significantly.


Now, let's examine trends associated with the ban status of the author.

In [15]:
# Get counts for each group combination of claim status and author ban status
group_counts = data.groupby(['claim_status', 'author_ban_status']).size().reset_index(name='count')
group_counts

Unnamed: 0,claim_status,author_ban_status,count
0,claim,active,6566
1,claim,banned,1439
2,claim,under review,1603
3,opinion,active,8817
4,opinion,banned,196
5,opinion,under review,463


There are more "claim" videos from banned authors (1,439) compared to "opinion" videos (196). This could be because "claim" videos may be more likely to violate platform policies or attract reports, leading to author bans. "Opinion" videos, on the other hand, may express personal views without causing issues.

Now let's focuse on `author_ban_status` and calculate the median video share count of each author ban status.

In [16]:
data.groupby(['author_ban_status'])["video_share_count"].median()

author_ban_status
active            437.0
banned          14468.0
under review     9444.0
Name: video_share_count, dtype: float64

The median video share count for videos with "banned" authors is significantly higher (14,468) compared to those with "active" authors (437). This suggests that videos from banned authors are shared far more frequently than those from active authors.

This could indicate that videos from banned authors are either more controversial or attract more attention, leading to increased sharing. It's possible that the ban could generate curiosity or a sense of rebellion among viewers, resulting in higher engagement.

On the other hand, active authors may produce content that is more aligned with platform guidelines or is less contentious, leading to fewer shares. Further investigation into the content and context of these videos might provide a deeper understanding of this behavior.

In [17]:
data.groupby(["author_ban_status"]).agg({'video_view_count': ['count', 'mean', 'median'],
                                         'video_like_count': ['count', 'mean', 'median'],
                                         'video_share_count': ['count', 'mean', 'median']})

Unnamed: 0_level_0,video_view_count,video_view_count,video_view_count,video_like_count,video_like_count,video_like_count,video_share_count,video_share_count,video_share_count
Unnamed: 0_level_1,count,mean,median,count,mean,median,count,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
active,15383,215927.039524,8616.0,15383,71036.533836,2222.0,15383,14111.466164,437.0
banned,1635,445845.439144,448201.0,1635,153017.236697,105573.0,1635,29998.942508,14468.0
under review,2066,392204.836399,365245.5,2066,128718.050339,71204.5,2066,25774.696999,9444.0


**Question:** What do you notice about the number of views, likes, and shares for banned authors compared to active authors?

Banned authors have significantly higher views, likes, and shares compared to active authors. Here’s a closer look:

Views: Banned authors have a much higher mean (445,845) and median (448,201) view count compared to active authors, whose mean is 215,927 and median is 8,616. This suggests that banned authors' videos tend to attract much more attention and are viewed more frequently, possibly due to their controversial content.

Likes: Similarly, the mean and median for likes are higher for banned authors (153,017 mean and 105,573 median) than for active authors (71,036 mean and 2,222 median). This further supports the idea that banned content often generates more engagement, likely driven by heightened curiosity or debate.

Shares: The share count for banned authors is also considerably higher, with a mean of 29,998 and median of 14,468, compared to active authors, whose shares have a mean of 14,111 and median of 437. This indicates that banned authors’ videos are being shared much more widely, potentially due to their provocative nature or the discussions they generate.

In conclusion, the content of banned authors seems to generate much more attention, likely because of its controversial or attention-grabbing nature. This results in higher engagement metrics like views, likes, and shares compared to active authors.


Now, let's create three new columns to help better understand engagement rates:
* `likes_per_view`: represents the number of likes divided by the number of views for each video
* `comments_per_view`: represents the number of comments divided by the number of views for each video
* `shares_per_view`: represents the number of shares divided by the number of views for each video

In [18]:
# Create a likes_per_view column
data['likes_per_view'] = data['video_like_count'] / data['video_view_count']

# Create a comments_per_view column
data['comments_per_view'] = data['video_comment_count'] / data['video_view_count']

# Create a shares_per_view column
data['shares_per_view'] = data['video_share_count'] / data['video_view_count']

We will use `groupby()` to compile the information in each of the three newly created columns for each combination of categories of claim status and author ban status, then use `agg()` to calculate the count, the mean, and the median of each group.

In [19]:
data.groupby(["author_ban_status", "claim_status"]).agg({'likes_per_view': ['count', 'mean', 'median'],
                                                         'comments_per_view': ['count', 'mean', 'median'],
                                                         'shares_per_view': ['count', 'mean', 'median']})

Unnamed: 0_level_0,Unnamed: 1_level_0,likes_per_view,likes_per_view,likes_per_view,comments_per_view,comments_per_view,comments_per_view,shares_per_view,shares_per_view,shares_per_view
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,count,mean,median,count,mean,median
author_ban_status,claim_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
active,claim,6566,0.329542,0.326538,6566,0.001393,0.000776,6566,0.065456,0.049279
active,opinion,8817,0.219744,0.21833,8817,0.000517,0.000252,8817,0.043729,0.032405
banned,claim,1439,0.345071,0.358909,1439,0.001377,0.000746,1439,0.067893,0.051606
banned,opinion,196,0.206868,0.198483,196,0.000434,0.000193,196,0.040531,0.030728
under review,claim,1603,0.327997,0.320867,1603,0.001367,0.000789,1603,0.065733,0.049967
under review,opinion,463,0.226394,0.228051,463,0.000536,0.000293,463,0.044472,0.035027


How does the data for claim videos and opinion videos compare or differ? 

**Likes per View:**


Claim Videos: Claim videos tend to have a higher average number of likes per view (around 0.33) compared to opinion videos (around 0.22). This indicates that viewers engage more with claim videos by liking them.
Opinion Videos: The average likes per view is lower for opinion videos, which might suggest that viewers are less likely to like opinion-based content.

**Comments per View:**

Claim Videos: The comments per view are generally higher for claim videos (around 0.0014) compared to opinion videos (around 0.0005). This suggests that claim videos may generate more discussions and reactions.
Opinion Videos: Opinion videos, despite having a lower number of comments per view, still see a considerable amount of engagement, though not as much as claim videos.

**Shares per View:**

Claim Videos: Claim videos also have a higher average number of shares per view (around 0.065) compared to opinion videos (around 0.04). This indicates that claim videos are more likely to be shared by viewers, possibly because of their more direct or impactful nature.
Opinion Videos: Although opinion videos are shared less frequently, they still have a reasonable share rate, reflecting a moderate level of engagement in this aspect.


## **PACE: Execute**

### Summary for the TikTok data team**

*   What percentage of the data is comprised of claims and what percentage is comprised of opinions?

    Claims vs. Opinions: Claims make up 49.6%, opinions 50.3% of the dataset.

*   What factors correlate with a video's claim status?

    Factors Correlating with Claim Status: Higher engagement (likes, shares, comments), author ban status, and possibly  content type.

*   What factors correlate with a video's engagement level?

    Factors Correlating with Engagement: Claim videos get more engagement; author ban status, views.
