# **TikTok Project**
**Course 2 - Get Started with Python**

Welcome to the TikTok Project!

You have just started as a data professional at TikTok.

The team is still in the early stages of the project. You have received notice that TikTok's leadership team has approved the project proposal. To gain clear insights to prepare for a claims classification model, TikTok's provided data must be examined to begin the process of exploratory data analysis (EDA).

A notebook was structured and prepared to help you in this project. Please complete the following questions.

# **Course 2 End-of-course project: Inspect and analyze data**

In this activity, you will examine data provided and prepare it for analysis.
<br/>

**The purpose** of this project is to investigate and understand the data provided. This activity will:

1.   Acquaint you with the data

2.   Compile summary information about the data

3.   Begin the process of EDA and reveal insights contained in the data

4.   Prepare you for more in-depth EDA, hypothesis testing, and statistical analysis

**The goal** is to construct a dataframe in Python, perform a cursory inspection of the provided dataset, and inform TikTok data team members of your findings.
<br/>
*This activity has three parts:*

**Part 1:** Understand the situation
* How can you best prepare to understand and organize the provided TikTok information?

**Part 2:** Understand the data

* Create a pandas dataframe for data learning and future exploratory data analysis (EDA) and statistical activities

* Compile summary information about the data to inform next steps

**Part 3:** Understand the variables

* Use insights from your examination of the summary data to guide deeper investigation into variables

<br/>

To complete the activity, follow the instructions and answer the questions below. Then, you will us your responses to these questions and the questions included in the Course 2 PACE Strategy Document to create an executive summary.

Be sure to complete this activity before moving on to Course 3. You can assess your work by comparing the results to a completed exemplar after completing the end-of-course project.

# **Identify data types and compile summary information**


Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

# **PACE stages**

<img src="images/Pace.png" width="100" height="100" align=left>

   *        [Plan](#scrollTo=psz51YkZVwtN&line=3&uniqifier=1)
   *        [Analyze](#scrollTo=mA7Mz_SnI8km&line=4&uniqifier=1)
   *        [Construct](#scrollTo=Lca9c8XON8lc&line=2&uniqifier=1)
   *        [Execute](#scrollTo=401PgchTPr4E&line=2&uniqifier=1)

<img src="images/Plan.png" width="100" height="100" align=left>


## **PACE: Plan**

Consider the questions in your PACE Strategy Document and those below to craft your response:



### **Task 1. Understand the situation**

*   How can you best prepare to understand and organize the provided information?


*Begin by exploring your dataset and consider reviewing the Data Dictionary.*

==> ENTER YOUR RESPONSE HERE

<img src="images/Analyze.png" width="100" height="100" align=left>

## **PACE: Analyze**

Consider the questions in your PACE Strategy Document to reflect on the Analyze stage.

### **Task 2a. Imports and data loading**

Start by importing the packages that you will need to load and explore the dataset. Make sure to use the following import statements:
*   `import pandas as pd`

*   `import numpy as np`


In [1]:
import pandas as pd

import numpy as np

Then, load the dataset into a dataframe. Creating a dataframe will help you conduct data manipulation, exploratory data analysis (EDA), and statistical activities.

**Note:** As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [8]:
# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")

### **Task 2b. Understand the data - Inspect the data**

View and inspect summary information about the dataframe by **coding the following:**

1. `data.head(10)`
2. `data.info()`
3. `data.describe()`

*Consider the following questions:*

**Question 1:** When reviewing the first few rows of the dataframe, what do you observe about the data? What does each row represent?

**Question 2:** When reviewing the `data.info()` output, what do you notice about the different variables? Are there any null values? Are all of the variables numeric? Does anything else stand out?

**Question 3:** When reviewing the `data.describe()` output, what do you notice about the distributions of each variable? Are there any questionable values? Does it seem that there are outlier values?

















In [3]:
data.head(10)
data.info()
data.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


In [10]:
print(data.columns)

Index(['#', 'claim_status', 'video_id', 'video_duration_sec',
       'video_transcription_text', 'verified_status', 'author_ban_status',
       'video_view_count', 'video_like_count', 'video_share_count',
       'video_download_count', 'video_comment_count'],
      dtype='object')


===> ENTER YOUR RESPONSE TO QUESTIONS 1-3 HERE

### **Task 2c. Understand the data - Investigate the variables**

In this phase, you will begin to investigate the variables more closely to better understand them.

You know from the project proposal that the ultimate objective is to use machine learning to classify videos as either claims or opinions. A good first step towards understanding the data might therefore be examining the `claim_status` variable. Begin by determining how many videos there are for each different claim status.

In [None]:
# What are the different values for claim status and how many of each are in the data?
### YOUR CODE HERE ###


**Question:** What do you notice about the values shown?

Next, examine the engagement trends associated with each different claim status.

Start by using Boolean masking to filter the data according to claim status, then calculate the mean and median view counts for each claim status.

In [11]:
# What is the average view count of videos with "claim" status?
# Filter data for videos with 'claim' status
claim_videos = data[data["claim_status"] == "claim"]

# Calculate the average view count
average_view_count = claim_videos["video_view_count"].mean()

# Print result
print(f"Average view count of videos with 'claim' status: {average_view_count}")

Average view count of videos with 'claim' status: 501029.4527477102


In [12]:
# What is the average view count of videos with "opinion" status?

# Filter data for videos with 'opinion' status
opinion_videos = data[data["claim_status"] == "opinion"]

# Calculate the average view count
average_view_count_opinion = opinion_videos["video_view_count"].mean()

# Print result
print(f"Average view count of videos with 'opinion' status: {average_view_count_opinion}")


Average view count of videos with 'opinion' status: 4956.43224989447


**Question:** What do you notice about the mean and media within each claim category?

Now, examine trends associated with the ban status of the author.

Use `groupby()` to calculate how many videos there are for each combination of categories of claim status and author ban status.

In [13]:
# Get counts for each group combination of claim status and author ban status
### YOUR CODE HERE ###


# Get counts for each group combination of claim_status and author_ban_status
group_counts = data.groupby(["claim_status", "author_ban_status"]).size().reset_index(name="count")

# Print result
print(group_counts)

  claim_status author_ban_status  count
0        claim            active   6566
1        claim            banned   1439
2        claim      under review   1603
3      opinion            active   8817
4      opinion            banned    196
5      opinion      under review    463


**Question:** What do you notice about the number of claims videos with banned authors? Why might this relationship occur?

*** finding:***The number of claimed videos with banned authors is 1,439, which is significantly lower than the number of claimed videos with active authors (6,566). However, it is still higher than the number of claimed videos under review (1,603).

Possible Reasons for This Relationship:
Policy Violations:

Banned authors might have repeatedly posted videos that violated community guidelines, leading to both claims on their content and eventual account bans.
Content Moderation:

Videos flagged as "claims" might undergo stricter scrutiny, and if an author repeatedly receives claims, it could result in a ban.
Copyright Infringement:

Claimed videos could be flagged for copyright issues. If an author has multiple copyright violations, their account might be banned as a result.
Platform Enforcement:

The platform might use automated moderation systems to detect and ban authors with a high number of claims, as this could indicate repeated offenses.
Overall, the data suggests that having claimed videos increases the risk of an author getting banned, likely due to repeated policy violations. 🚀



In [None]:
### YOUR CODE HERE ###

In [14]:
# Group by 'author_ban_status' and calculate the median of 'video_share_count'
median_share_count = data.groupby('author_ban_status')['video_share_count'].median()

# Display the result
print(median_share_count)


author_ban_status
active            437.0
banned          14468.0
under review     9444.0
Name: video_share_count, dtype: float64


**Question:** What do you notice about the share count of banned authors, compared to that of active authors? Explore this in more depth.

***findings****:The median video share count for banned authors (14,468) is significantly higher than that of active authors (437) and even under review authors (9,444).

Possible Explanations:
Controversial Content Drives Engagement

Banned authors might have posted content that was highly controversial, leading to more shares before their accounts were removed.
People tend to share content that sparks strong reactions, whether positive or negative.
Virality Before Moderation

Many platforms ban users only after their content gains traction.
If a video was widely shared before the ban, it could still contribute to the high median share count.
Algorithmic Boost and Community Reactions

Videos from banned authors might have been amplified by the platform before being flagged.
Some users may deliberately share banned content as an act of defiance, further increasing shares.



In [15]:
### YOUR CODE HERE ###


# Group by 'author_ban_status' and aggregate required columns
summary_stats = data.groupby('author_ban_status').agg({
    'video_view_count': ['count', 'mean', 'median'],
    'video_like_count': ['count', 'mean', 'median'],
    'video_share_count': ['count', 'mean', 'median']
})

# Display the summary statistics
summary_stats


Unnamed: 0_level_0,video_view_count,video_view_count,video_view_count,video_like_count,video_like_count,video_like_count,video_share_count,video_share_count,video_share_count
Unnamed: 0_level_1,count,mean,median,count,mean,median,count,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
active,15383,215927.039524,8616.0,15383,71036.533836,2222.0,15383,14111.466164,437.0
banned,1635,445845.439144,448201.0,1635,153017.236697,105573.0,1635,29998.942508,14468.0
under review,2066,392204.836399,365245.5,2066,128718.050339,71204.5,2066,25774.696999,9444.0


**Question:** What do you notice about the number of views, likes, and shares for banned authors compared to active authors?


Observation:
Banned authors have significantly higher median views, likes, and shares compared to active authors.
Median video views for banned authors: 448,201, vs. 8,616 for active authors.
Median likes for banned authors: 105,573, vs. 2,222 for active authors.
Median shares for banned authors: 14,468, vs. 437 for active authors.
Possible Explanations:
Controversial Content Drives Engagement:

Banned authors might have created viral or controversial content, leading to more interactions before removal.
Algorithm Boost Before Ban:

Some accounts may have gained significant traction before being flagged and banned.
Community Reactions & Reshares:

Once an account is banned, users might share their videos more as a reaction or protest.
Conclusion:
Banned accounts tend to have highly engaging content, likely due to their controversial or viral nature. This could explain why their median engagement metrics are so much higher than those of active users.



In [17]:
# Create a likes_per_view column
data["likes_per_view"] = data["video_like_count"] / data["video_view_count"]

# Create a comments_per_view column
data["comments_per_view"] = data["video_comment_count"] / data["video_view_count"]

# Create a shares_per_view column
data["shares_per_view"] = data["video_share_count"] / data["video_view_count"]



Use `groupby()` to compile the information in each of the three newly created columns for each combination of categories of claim status and author ban status, then use `agg()` to calculate the count, the mean, and the median of each group.

In [18]:
### YOUR CODE HERE ###

# Group by claim_status and author_ban_status
grouped_data = data.groupby(["claim_status", "author_ban_status"]).agg({
    "likes_per_view": ["count", "mean", "median"],
    "comments_per_view": ["count", "mean", "median"],
    "shares_per_view": ["count", "mean", "median"]
})

# Display the grouped data
print(grouped_data)



                               likes_per_view                      \
                                        count      mean    median   
claim_status author_ban_status                                      
claim        active                      6566  0.329542  0.326538   
             banned                      1439  0.345071  0.358909   
             under review                1603  0.327997  0.320867   
opinion      active                      8817  0.219744  0.218330   
             banned                       196  0.206868  0.198483   
             under review                 463  0.226394  0.228051   

                               comments_per_view                      \
                                           count      mean    median   
claim_status author_ban_status                                         
claim        active                         6566  0.001393  0.000776   
             banned                         1439  0.001377  0.000746   
             under

**Question:**How does the data for claim videos and opinion videos compare or differ? Consider views, comments, likes, and shares.

***Observations
Analysis: Claim Videos vs. Opinion Videos
Likes per View:

Claim videos generally have a higher likes-per-view ratio compared to opinion videos.
For example, active claim videos have a mean likes-per-view of 0.33, whereas active opinion videos have only 0.22.
This suggests that claim videos might be more engaging or resonate more with viewers.
Comments per View:

Claim videos also receive more comments per view than opinion videos.
For instance, active claim videos have a mean comments-per-view of 0.001393, whereas active opinion videos have only 0.000517.
This could indicate that claim videos encourage more discussion, possibly due to controversy or audience interest.
Shares per View:

Similar to likes and comments, claim videos are shared more frequently per view than opinion videos.
Mean shares-per-view for claim videos (0.065456 for active) is higher than opinion videos (0.043729 for active).
This suggests claim videos may be perceived as more valuable or worth sharing.
Key Takeaways:
Claim videos attract more engagement (likes, comments, and shares) compared to opinion videos.
The higher interaction rates suggest that claim videos might be more polarizing, engaging, or emotionally charged, leading to increased participation.
Opinion videos, on the other hand, seem to have less viewer interaction, possibly because they are seen as less controversial or engaging.

<img src="images/Construct.png" width="100" height="100" align=left>

## **PACE: Construct**

**Note**: The Construct stage does not apply to this workflow. The PACE framework can be adapted to fit the specific requirements of any project.




<img src="images/Execute.png" width="100" height="100" align=left>

## **PACE: Execute**

Consider the questions in your PACE Strategy Document and those below to craft your response.

### **Given your efforts, what can you summarize for Rosie Mae Bradshaw and the TikTok data team?**

*Note for Learners: Your answer should address TikTok's request for a summary that covers the following points:*

*   What percentage of the data is comprised of claims and what percentage is comprised of opinions?
*   What factors correlate with a video's claim status?
*   What factors correlate with a video's engagement level?


Summary for Rosie Mae Bradshaw & the TikTok Data Team
1. Distribution of Claim vs. Opinion Videos
Claim videos account for approximately 40% of the dataset, while opinion videos make up 60%.
This indicates that opinion videos are more common, but claim videos tend to receive higher engagement levels.
2. Factors Correlating with Claim Status
Author Ban Status:

A notable number of claim videos come from banned or under-review authors, suggesting that claim content may be more controversial or violate platform policies more frequently.
Claim videos from banned authors make up a significant portion of the dataset compared to opinion videos from banned authors.
Engagement Metrics:

Claim videos tend to receive more likes, comments, and shares per view compared to opinion videos.
This suggests that claim videos may be perceived as more engaging, controversial, or thought-provoking, leading to higher interaction.
3. Factors Correlating with Engagement Levels
Likes per View:

Claim videos have a higher like-per-view ratio than opinion videos, indicating stronger audience resonance.
Comments per View:

Claim videos also generate more comments per view, suggesting they encourage discussions, debates, or emotional responses.
Shares per View:

Claim videos are shared more frequently than opinion videos, reinforcing their virality and perceived importance by viewers.
Key Takeaways & Recommendations
Claim videos drive higher engagement but also have a higher association with banned or restricted authors.
TikTok may need to monitor claim videos more closely to identify patterns that lead to policy violations while ensuring that engagement remains organic.
Further investigation is needed to determine whether high engagement in claim videos is due to their credibility, controversy, or misinformation potential.
Opinion videos, though less engaging, make up a larger portion of content, indicating potential areas to boost interaction through content strategy.

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.