# **TikTok Project**
**Course 2 - Get Started with Python**

Welcome to the TikTok Project!

You have just started as a data professional at TikTok.

The team is still in the early stages of the project. You have received notice that TikTok's leadership team has approved the project proposal. To gain clear insights to prepare for a claims classification model, TikTok's provided data must be examined to begin the process of exploratory data analysis (EDA).

A notebook was structured and prepared to help you in this project. Please complete the following questions.

# **Course 2 End-of-course project: Inspect and analyze data**

In this activity, you will examine data provided and prepare it for analysis.
<br/>

**The purpose** of this project is to investigate and understand the data provided. This activity will:

1.   Acquaint you with the data

2.   Compile summary information about the data

3.   Begin the process of EDA and reveal insights contained in the data

4.   Prepare you for more in-depth EDA, hypothesis testing, and statistical analysis

**The goal** is to construct a dataframe in Python, perform a cursory inspection of the provided dataset, and inform TikTok data team members of your findings.
<br/>
*This activity has three parts:*

**Part 1:** Understand the situation
* How can you best prepare to understand and organize the provided TikTok information?

**Part 2:** Understand the data

* Create a pandas dataframe for data learning and future exploratory data analysis (EDA) and statistical activities

* Compile summary information about the data to inform next steps

**Part 3:** Understand the variables

* Use insights from your examination of the summary data to guide deeper investigation into variables

<br/>

To complete the activity, follow the instructions and answer the questions below. Then, you will us your responses to these questions and the questions included in the Course 2 PACE Strategy Document to create an executive summary.

Be sure to complete this activity before moving on to Course 3. You can assess your work by comparing the results to a completed exemplar after completing the end-of-course project.

# **Identify data types and compile summary information**


Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

# **PACE stages**

<img src="images/Pace.png" width="100" height="100" align=left>

   *        [Plan](#scrollTo=psz51YkZVwtN&line=3&uniqifier=1)
   *        [Analyze](#scrollTo=mA7Mz_SnI8km&line=4&uniqifier=1)
   *        [Construct](#scrollTo=Lca9c8XON8lc&line=2&uniqifier=1)
   *        [Execute](#scrollTo=401PgchTPr4E&line=2&uniqifier=1)

<img src="images/Plan.png" width="100" height="100" align=left>


## **PACE: Plan**

Consider the questions in your PACE Strategy Document and those below to craft your response:



### **Task 1. Understand the situation**

*   How can you best prepare to understand and organize the provided information?


*Begin by exploring your dataset and consider reviewing the Data Dictionary.*

==> ENTER YOUR RESPONSE HERE

<img src="images/Analyze.png" width="100" height="100" align=left>

## **PACE: Analyze**

Consider the questions in your PACE Strategy Document to reflect on the Analyze stage.

### **Task 2a. Imports and data loading**

Start by importing the packages that you will need to load and explore the dataset. Make sure to use the following import statements:
*   `import pandas as pd`

*   `import numpy as np`


In [2]:
# Import packages
import numpy as np
import pandas as pd

Then, load the dataset into a dataframe. Creating a dataframe will help you conduct data manipulation, exploratory data analysis (EDA), and statistical activities.

**Note:** As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [3]:
# Load dataset into dataframe
data = pd.read_csv("Files/tiktok_dataset.csv")

In [4]:
data.iloc[1,4]

'someone shared with me that there are more microorganisms in one teaspoon of soil than people on the planet'

In [5]:
data["author_ban_status"].value_counts()

author_ban_status
active          15663
under review     2080
banned           1639
Name: count, dtype: int64

### **Task 2b. Understand the data - Inspect the data**

View and inspect summary information about the dataframe by **coding the following:**

1. `data.head(10)`
2. `data.info()`
3. `data.describe()`

*Consider the following questions:*

**Question 1:** When reviewing the first few rows of the dataframe, what do you observe about the data? What does each row represent?

**Question 2:** When reviewing the `data.info()` output, what do you notice about the different variables? Are there any null values? Are all of the variables numeric? Does anything else stand out?

**Question 3:** When reviewing the `data.describe()` output, what do you notice about the distributions of each variable? Are there any questionable values? Does it seem that there are outlier values?

















**1.
  rows represent claim reports \n
2.
  8 numbers, 4 strings
  floats can be integers
  all variables expect id,video_duration_sec, verified_status and author_ban_status exactly have 298 Null values (NaN)
  \# variable is unnecessary, because pandas has its own counter\n
3.**
  

In [6]:
# Display and examine the first ten rows of the dataframe
data


Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
...,...,...,...,...,...,...,...,...,...,...,...,...
19377,19378,,7578226840,21,,not verified,active,,,,,
19378,19379,,6079236179,53,,not verified,active,,,,,
19379,19380,,2565539685,10,,verified,under review,,,,,
19380,19381,,2969178540,24,,not verified,active,,,,,


In [7]:
# Get summary info
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


In [36]:
data.info(show_counts=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 15 columns):
 #   Column                    Dtype  
---  ------                    -----  
 0   #                         int64  
 1   claim_status              object 
 2   video_id                  int64  
 3   video_duration_sec        int64  
 4   video_transcription_text  object 
 5   verified_status           object 
 6   author_ban_status         object 
 7   video_view_count          float64
 8   video_like_count          float64
 9   video_share_count         float64
 10  video_download_count      float64
 11  video_comment_count       float64
 12  likes_per_view            float64
 13  comments_per_view         float64
 14  shares_per_view           float64
dtypes: float64(8), int64(3), object(4)
memory usage: 2.2+ MB


In [9]:
wtf = data.groupby(["claim_status"]).agg("sum").info()

<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, claim to opinion
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         2 non-null      int64  
 1   video_id                  2 non-null      int64  
 2   video_duration_sec        2 non-null      int64  
 3   video_transcription_text  2 non-null      object 
 4   verified_status           2 non-null      object 
 5   author_ban_status         2 non-null      object 
 6   video_view_count          2 non-null      float64
 7   video_like_count          2 non-null      float64
 8   video_share_count         2 non-null      float64
 9   video_download_count      2 non-null      float64
 10  video_comment_count       2 non-null      float64
dtypes: float64(5), int64(3), object(3)
memory usage: 192.0+ bytes


In [10]:
# Get summary statistics
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


**video_view_count, like_count has min max outliers**

**video_id unnecessary here**

### **Task 2c. Understand the data - Investigate the variables**

In this phase, you will begin to investigate the variables more closely to better understand them.

You know from the project proposal that the ultimate objective is to use machine learning to classify videos as either claims or opinions. A good first step towards understanding the data might therefore be examining the `claim_status` variable. Begin by determining how many videos there are for each different claim status.

In [11]:
# What are the different values for claim status and how many of each are in the data?
data["claim_status"].value_counts(normalize=True)


claim_status
claim      0.503458
opinion    0.496542
Name: proportion, dtype: float64

In [12]:
9600/(9600+9476)

0.5032501572656741

**50/50 claim/opinion**

**Question:** What do you notice about the values shown?

Next, examine the engagement trends associated with each different claim status.

Start by using Boolean masking to filter the data according to claim status, then calculate the mean and median view counts for each claim status.

In [13]:
# What is the average view count of videos with "claim" status?

mask_claim = (data["claim_status"] == "claim")
data_claim = data[mask_claim]
data_claim["video_view_count"].describe().round()[["min","max","mean","50%"]]



min       1049.0
max     999817.0
mean    501029.0
50%     501555.0
Name: video_view_count, dtype: float64

In [14]:
# What is the average view count of videos with "opinion" status?
mask_claim = (data["claim_status"] == "opinion")
data_claim = data[mask_claim]
data_claim["video_view_count"].describe().round()[["min","max","mean","50%"]]

min       20.0
max     9998.0
mean    4956.0
50%     4953.0
Name: video_view_count, dtype: float64


**The mean/median view count of claims is 100x higher than the view count of the opinion ---> inference: claim contained videos are going more viral than opinions**

**Question:** What do you notice about the mean and media within each claim category?

Now, examine trends associated with the ban status of the author.

Use `groupby()` to calculate how many videos there are for each combination of categories of claim status and author ban status.

In [15]:
# Get counts for each group combination of claim status and author ban status
### YOUR CODE HERE ###
data_group = data.groupby(["claim_status","author_ban_status"], ).count()[["#"]]
data_group, len(data_group)

(                                   #
 claim_status author_ban_status      
 claim        active             6566
              banned             1439
              under review       1603
 opinion      active             8817
              banned              196
              under review        463,
 6)

*Reports with claim status have 8x more banned authors and 4x more under review status than reports with opinion status, but 0.7x active authors*

  **inference: claim-statused reports tend to result in more author bans than opinion-statused reports**

  


**Question:** What do you notice about the number of claims videos with banned authors? Why might this relationship occur?

Continue investigating engagement levels, now focusing on `author_ban_status`.

Calculate the median video share count of each author ban status.

In [16]:
data[:3]

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0


In [17]:
### YOUR CODE HERE ###

In [18]:
# What's the median video share count of each author ban status?
data.groupby("author_ban_status").median(numeric_only=True)[["video_share_count"]]

Unnamed: 0_level_0,video_share_count
author_ban_status,Unnamed: 1_level_1
active,437.0
banned,14468.0
under review,9444.0


In [19]:
data.groupby(["author_ban_status"])[["video_view_count"]].agg(["count","mean","median"])

Unnamed: 0_level_0,video_view_count,video_view_count,video_view_count
Unnamed: 0_level_1,count,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
active,15383,215927.039524,8616.0
banned,1635,445845.439144,448201.0
under review,2066,392204.836399,365245.5


**the median video share count of banned authors is 30 times higher than that of active authors**

**Question:** What do you notice about the share count of banned authors, compared to that of active authors? Explore this in more depth.

Use `groupby()` to group the data by `author_ban_status`, then use `agg()` to get the count, mean, and median of each of the following columns:
* `video_view_count`
* `video_like_count`
* `video_share_count`

Remember, the argument for the `agg()` function is a dictionary whose keys are columns. The values for each column are a list of the calculations you want to perform.

In [20]:
functs = ["count","mean","median","min","max"]

author_status_group = data.groupby(["claim_status","author_ban_status"]).agg({
  "video_view_count":functs,
  "video_like_count":functs,
  "video_share_count":functs,
  "video_download_count":functs,
  "video_comment_count":functs
})
pd.set_option('display.max_columns', None)
author_status_group

Unnamed: 0_level_0,Unnamed: 1_level_0,video_view_count,video_view_count,video_view_count,video_view_count,video_view_count,video_like_count,video_like_count,video_like_count,video_like_count,video_like_count,video_share_count,video_share_count,video_share_count,video_share_count,video_share_count,video_download_count,video_download_count,video_download_count,video_download_count,video_download_count,video_comment_count,video_comment_count,video_comment_count,video_comment_count,video_comment_count
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,min,max,count,mean,median,min,max,count,mean,median,min,max,count,mean,median,min,max,count,mean,median,min,max
claim_status,author_ban_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2,Unnamed: 25_level_2,Unnamed: 26_level_2
claim,active,6566,499221.733171,499500.5,1207.0,999817.0,6566,164960.302924,121943.0,0.0,657830.0,6566,32769.101889,17774.5,0.0,256130.0,6566,2048.66951,1124.5,0.0,14308.0,6566,687.827901,286.0,0.0,8674.0
claim,banned,1439,505907.917304,512572.0,2557.0,997703.0,1439,173719.102849,132044.0,134.0,653561.0,1439,34056.580959,19018.0,12.0,249672.0,1439,2141.457957,1204.0,0.0,14994.0,1439,698.402363,296.0,0.0,8470.0
claim,under review,1603,504054.640674,500774.0,1049.0,999655.0,1603,165566.95446,125882.0,320.0,647236.0,1603,33155.623206,18084.0,14.0,238004.0,1603,2098.931379,1153.0,0.0,14417.0,1603,698.336245,279.0,0.0,9599.0
opinion,active,8817,4958.120563,4958.0,20.0,9998.0,8817,1091.714982,820.0,0.0,4375.0,8817,217.16695,121.0,0.0,1674.0,8817,13.665986,7.0,0.0,101.0,8817,2.696609,1.0,0.0,32.0
opinion,banned,196,4876.530612,5083.5,72.0,9916.0,196,1027.515306,799.5,1.0,4083.0,196,208.423469,108.5,0.0,1269.0,196,12.938776,6.0,0.0,97.0,196,2.311224,1.0,0.0,28.0
opinion,under review,463,4958.105832,4884.0,75.0,9964.0,463,1139.663067,876.0,0.0,4276.0,463,220.431965,124.0,0.0,1204.0,463,14.205184,8.0,0.0,89.0,463,2.87689,1.0,0.0,26.0


In [21]:
# less data to not overload the details image for executive summary(es)
functions_es = ["mean"]

es_correlate_thesis = data.groupby(["author_ban_status"]).agg({
  "video_view_count":functions_es,
  "video_like_count":functions_es,
  "video_share_count":functions_es,
  "video_download_count":functions_es,
  "video_comment_count":functions_es
})

#rename columns for clarity

es_correlate_thesis = es_correlate_thesis.rename(columns = {
  "video_view_count": "views",
  "video_like_count": "likes",
  "video_share_count": "shares",
  "video_download_count": "downloads",
  "video_comment_count": "comments"
})

es_correlate_thesis = es_correlate_thesis.astype("int") 

pd.set_option('display.max_columns', None)

es_correlate_thesis.drop(["under review"], axis=0, inplace=True)

es_correlate_thesis.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, active to banned
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   (views, mean)      2 non-null      int32
 1   (likes, mean)      2 non-null      int32
 2   (shares, mean)     2 non-null      int32
 3   (downloads, mean)  2 non-null      int32
 4   (comments, mean)   2 non-null      int32
dtypes: int32(5)
memory usage: 56.0+ bytes


In [22]:
es_correlate_thesis.columns, type(es_correlate_thesis.columns), len(es_correlate_thesis.iloc[0])

(MultiIndex([(    'views', 'mean'),
             (    'likes', 'mean'),
             (   'shares', 'mean'),
             ('downloads', 'mean'),
             ( 'comments', 'mean')],
            ),
 pandas.core.indexes.multi.MultiIndex,
 5)

In [23]:
""" test list comprehension"""

l = [[1,2,3,4,5],[6,7,8,9,10]]
newl = [ (l[0][i] / l[1][i]) for i in range(len(l[0]))]
newl

[0.16666666666666666, 0.2857142857142857, 0.375, 0.4444444444444444, 0.5]

In [24]:
es_correlate_thesis.loc["-------------------"] = ["----","----","----","----","----"]
# claim/opinion ration of views (example) is calculated with views claim / views opinion
es_correlate_thesis.loc["banned/active ratio"] = [ (es_correlate_thesis.iloc[1,i] / es_correlate_thesis.iloc[0,i]) for i in range(len(es_correlate_thesis.iloc[0])) ]
es_correlate_thesis.loc["banned/active ratio"] = es_correlate_thesis.loc["banned/active ratio"].astype("float").round(2).astype("str") +"x"

es_correlate_thesis

Unnamed: 0_level_0,views,likes,shares,downloads,comments
Unnamed: 0_level_1,mean,mean,mean,mean,mean
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
active,215927,71036,14111,882,295
banned,445845,153017,29998,1886,614
-------------------,----,----,----,----,----
banned/active ratio,2.06x,2.15x,2.13x,2.14x,2.08x


In [25]:
round(2.111, 2)

2.11

In [26]:
type(es_correlate_thesis.style)

pandas.io.formats.style.Styler

In [27]:
"x" in "100x" and type("x")== str

True

In [28]:
def all_bold(x, color):
  query = x["banned/active ratio"]
  return [("font-weight: bold") if query == y else "" for y in x]
    
es_correlate_thesis.style.apply(all_bold, color = "red")

Unnamed: 0_level_0,views,likes,shares,downloads,comments
Unnamed: 0_level_1,mean,mean,mean,mean,mean
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
active,215927,71036,14111,882,295
banned,445845,153017,29998,1886,614
-------------------,----,----,----,----,----
banned/active ratio,2.06x,2.15x,2.13x,2.14x,2.08x


**banned: more liked, shared, watched than active**

In [29]:
data[:3]

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0


**Question:** What do you notice about the number of views, likes, and shares for banned authors compared to active authors?

Now, create three new columns to help better understand engagement rates:
* `likes_per_view`: represents the number of likes divided by the number of views for each video
* `comments_per_view`: represents the number of comments divided by the number of views for each video
* `shares_per_view`: represents the number of shares divided by the number of views for each video

In [30]:
#  Create a likes_per_view column
data["likes_per_view"] = data["video_like_count"] / data["video_view_count"]
# Create a comments_per_view column
data["comments_per_view"] = data["video_comment_count"] / data["video_view_count"]

# Create a shares_per_view column
data["shares_per_view"] = data["video_share_count"] / data["video_view_count"]

In [31]:
data[:2]

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count,likes_per_view,comments_per_view,shares_per_view
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0,0.056584,0.0,0.000702
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0,0.549096,0.004855,0.135111


In [32]:
data["video_duration_sec"].describe()

count    19382.000000
mean        32.421732
std         16.229967
min          5.000000
25%         18.000000
50%         32.000000
75%         47.000000
max         60.000000
Name: video_duration_sec, dtype: float64

Use `groupby()` to compile the information in each of the three newly created columns for each combination of categories of claim status and author ban status, then use `agg()` to calculate the count, the mean, and the median of each group.

In [33]:
per_view_group = data.groupby(["claim_status","author_ban_status"]).agg({
  "likes_per_view":["count","mean","median"],
  "comments_per_view":["count","mean","median"],
  "shares_per_view":["count","mean","median"]
})
per_view_group

Unnamed: 0_level_0,Unnamed: 1_level_0,likes_per_view,likes_per_view,likes_per_view,comments_per_view,comments_per_view,comments_per_view,shares_per_view,shares_per_view,shares_per_view
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,count,mean,median,count,mean,median
claim_status,author_ban_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
claim,active,6566,0.329542,0.326538,6566,0.001393,0.000776,6566,0.065456,0.049279
claim,banned,1439,0.345071,0.358909,1439,0.001377,0.000746,1439,0.067893,0.051606
claim,under review,1603,0.327997,0.320867,1603,0.001367,0.000789,1603,0.065733,0.049967
opinion,active,8817,0.219744,0.21833,8817,0.000517,0.000252,8817,0.043729,0.032405
opinion,banned,196,0.206868,0.198483,196,0.000434,0.000193,196,0.040531,0.030728
opinion,under review,463,0.226394,0.228051,463,0.000536,0.000293,463,0.044472,0.035027


In [34]:
download_group = data.groupby(["claim_status","author_ban_status"])[["video_download_count"]].agg(["count","mean","median","min","max"])
download_group

Unnamed: 0_level_0,Unnamed: 1_level_0,video_download_count,video_download_count,video_download_count,video_download_count,video_download_count
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,min,max
claim_status,author_ban_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
claim,active,6566,2048.66951,1124.5,0.0,14308.0
claim,banned,1439,2141.457957,1204.0,0.0,14994.0
claim,under review,1603,2098.931379,1153.0,0.0,14417.0
opinion,active,8817,13.665986,7.0,0.0,101.0
opinion,banned,196,12.938776,6.0,0.0,97.0
opinion,under review,463,14.205184,8.0,0.0,89.0


In [35]:
percentage_data = data.groupby(["claim_status"]).agg({"#":"count"})
percentage_data = percentage_data.rename(columns= {"#":"count"})
percentage_data

Unnamed: 0_level_0,count
claim_status,Unnamed: 1_level_1
claim,9608
opinion,9476


**claim-statused videos were downloaded more than 100 times as much as opininion-statused videos on average**

**Claims have 2 to 3 times more likes, comments and shares per view than opinions. There is no significant difference to recognize between active and banned authors. **

**Question:**

How does the data for claim videos and opinion videos compare or differ? Consider views, comments, likes, and shares.

<img src="images/Construct.png" width="100" height="100" align=left>

## **PACE: Construct**

**Note**: The Construct stage does not apply to this workflow. The PACE framework can be adapted to fit the specific requirements of any project.




<img src="images/Execute.png" width="100" height="100" align=left>

## **PACE: Execute**

Consider the questions in your PACE Strategy Document and those below to craft your response.


  **8 numbers, 4 strings**

  **floats can be integers**

  **all variables expect id,video_duration_sec, verified_status and author_ban_status exactly have 298 Null values (NaN)**
  
  **\# variable is unnecessary, because pandas has its own counter**

  

**video_view_count, like_count has min max outliers**

**video_id unnecessary here**

**50/50 claim/opinion**

**The mean/median view count of claims is 100x higher than the view count of the opinion ---> inference: claim contained videos are going more viral than opinions**

*Reports with claim status have 8x more banned authors and 4x more under review status than reports with opinion status, but 0.7x active authors*

  **inference: claim-statused reports tend to result in more author bans than opinion-statused reports**

**the median video share count of banned authors is 30 times higher than that of active authors**

**banned: more liked, shared, watched than active**

**claim-statused videos were downloaded more than 100 times as much as opininion-statused videos on average**

**Claims have 2 to 3 times more likes, comments and shares per view than opinions. There is no significant difference to recognize between active and banned authors.**

### **Given your efforts, what can you summarize for Rosie Mae Bradshaw and the TikTok data team?**

*Note for Learners: Your answer should address TikTok's request for a summary that covers the following points:*

*   What percentage of the data is comprised of claims and what percentage is comprised of opinions?
*   What factors correlate with a video's claim status?
*   What factors correlate with a video's engagement level?


==>
The given TikTok dataset includes reports with numerous characteristics. In the dataset are 49% claim_statused reports, 49% opinion-statused reports and 2% without claim-status
The observation of the dataset results in following inferences:
Reports with the status "claim" receive a significantly higher number of views, likes, comments, shares and downloads compared to reports with the status "opinion".
Additionally, the engagement rates of the videos published by banned authors are higher than those of videos by active authors. 
Unfortunately, claim-statused reports tend to result in more author bans than opinion-statused reports.

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.