# **TikTok (case-study)**
<img src="https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/MboGNgnmRJmZcuXjBlohrw_f094b5d0c7e141608de68c7d9c4a28f1_image.png?expiry=1694217600000&hmac=dgVKDHm8C8ASDG0yETbUDYBbTiezQhOiRFXK1dVsC1c"/>

# **Understand the data**

This stage aims to investigate and understand the data provided, construct a data frame in Python, perform a cursory inspection of the provided dataset, and inform TikTok data team members of your findings.

### Approach
1. Review the data dictionary
2. Download the data and build a dataframe
3. Identify the variables
4. Assess quality of the data using standard data quality dimensions
     - Uniqueness: Recorded that are Duplicated 
     - Consistency: Values Free from Contradiction
     - Relavency: Data Items with Value Meta-data
     - Validity: Data Containing Allowable Values
     - Completeness: Null values
5. Select the two most important variables for training our model (requested from my team)
6. Share an executive summary with my team <br><br><br>

## 1. Review the data dictionary
<br>
<div data-testid="cml-viewer" class="css-1kgqbsw"><div><span><span></span></span><figure role="figure" contenteditable="false"><img src="https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/VC5JnYsqSD2sETeEaxxIAA_b57e7d4594fb4a939b6648e977786af1_image.png?expiry=1694217600000&amp;hmac=InB1WfjCW3ahMT22Uxx3GPJyi62m3K0Ot_N5_7Ok2TY" alt="" data-asset-id="VC5JnYsqSD2sETeEaxxIAA" class="cml-image-default undefined"></figure></div><p data-text-variant="body1"><span><span>This project uses a dataset called tiktok_dataset.csv. It contains synthetic data created for this project in partnership with TikTok. Examine each data variable gathered.&nbsp;</span></span></p><p data-text-variant="body1"><span><strong><span>19,383 rows</span></strong></span><span><span> – Each row represents a different published TikTok video in which a claim/opinion has been made.</span></span></p><p data-text-variant="body1"><span><strong><span>12 columns</span></strong></span><span><span>&nbsp;</span></span></p><div class="css-1yr0py9"><table><thead><tr><th scope="col"><p data-text-variant="body1"><span><strong><span>Column name</span></strong></span></p></th><th scope="col"><p data-text-variant="body1"><span><strong><span>Type</span></strong></span></p></th><th scope="col"><p data-text-variant="body1"><span><strong><span>Description</span></strong></span></p></th></tr></thead><tbody><tr><td><p data-text-variant="body1"><span><span>#</span></span></p></td><td><p data-text-variant="body1"><span><span>int</span></span></p></td><td><p data-text-variant="body1"><span><span>TikTok assigned number for video with claim/opinion.</span></span></p></td></tr><tr><td><p data-text-variant="body1"><span><span>claim_status</span></span></p></td><td><p data-text-variant="body1"><span><span>obj</span></span></p></td><td><p data-text-variant="body1"><span><span>Whether the published video has been identified as an “opinion” or a “claim.” In this dataset, an “opinion” refers to an individual’s or group’s personal belief or thought. A “claim” refers to information that is either unsourced or from an unverified source.</span></span></p></td></tr><tr><td><p data-text-variant="body1"><span><span>video_id</span></span></p></td><td><p data-text-variant="body1"><span><span>int</span></span></p></td><td><p data-text-variant="body1"><span><span>Random identifying number assigned to video upon publication on TikTok.</span></span></p></td></tr><tr><td><p data-text-variant="body1"><span><span>video_duration_sec</span></span></p></td><td><p data-text-variant="body1"><span><span>int</span></span></p></td><td><p data-text-variant="body1"><span><span>How long the published video is measured in seconds.</span></span></p></td></tr><tr><td><p data-text-variant="body1"><span><span>video_transcription_text</span></span></p></td><td><p data-text-variant="body1"><span><span>obj</span></span></p></td><td><p data-text-variant="body1"><span><span>Transcribed text of the words spoken in the published video.</span></span></p></td></tr><tr><td><p data-text-variant="body1"><span><span>verified_status</span></span></p></td><td><p data-text-variant="body1"><span><span>obj</span></span></p></td><td><p data-text-variant="body1"><span><span>Indicates the status of the TikTok user who published the video in terms of their verification, either “verified” or “not verified.”&nbsp;</span></span></p></td></tr><tr><td><p data-text-variant="body1"><span><span>author_ban_status</span></span></p></td><td><p data-text-variant="body1"><span><span>obj</span></span></p></td><td><p data-text-variant="body1"><span><span>Indicates the status of the TikTok user who published the video in terms of their permissions: “active,” “under scrutiny,” or “banned.”&nbsp;</span></span></p></td></tr><tr><td><p data-text-variant="body1"><span><span>video_view_count</span></span></p></td><td><p data-text-variant="body1"><span><span>float</span></span></p></td><td><p data-text-variant="body1"><span><span>The total number of times the published video has been viewed.&nbsp;</span></span></p></td></tr><tr><td><p data-text-variant="body1"><span><span>video_like_count</span></span></p></td><td><p data-text-variant="body1"><span><span>float</span></span></p></td><td><p data-text-variant="body1"><span><span>The total number of times the published video has been liked by other users.&nbsp;</span></span></p></td></tr><tr><td><p data-text-variant="body1"><span><span>video_share_count</span></span></p></td><td><p data-text-variant="body1"><span><span>float</span></span></p></td><td><p data-text-variant="body1"><span><span>The total number of times the published video has been shared by other users.&nbsp;</span></span></p></td></tr><tr><td><p data-text-variant="body1"><span><span>video_download_count</span></span></p></td><td><p data-text-variant="body1"><span><span>float</span></span></p></td><td><p data-text-variant="body1"><span><span>The total number of times the published video has been downloaded by other users.&nbsp;</span></span></p></td></tr><tr><td><p data-text-variant="body1"><span><span>video_comment_count</span></span></p></td><td><p data-text-variant="body1"><span><span>float</span></span></p></td><td><p data-text-variant="body1"><span><span>The total number of comments on the published video.&nbsp;</span></span></p></td></tr></tbody></table></div><p><span><span></span></span></p></div>

<br><br><br>
## 2. Download the data and build a dataframe
 


In [3]:
#Import libraries and packages listed above
import pandas as pd
import numpy as np

In [27]:
# Load dataset into dataframe
df = pd.read_csv("tiktok_dataset.csv", index_col=2)

<br><br><br>
## 3. Identify the variables

In [28]:
df.head(10)

Unnamed: 0_level_0,#,claim_status,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
7017666017,1,claim,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
4014381136,2,claim,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
9859838091,3,claim,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
1866847991,4,claim,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
7105231098,5,claim,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
8972200955,6,claim,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
4958886992,7,claim,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
2270982263,8,claim,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
5235769692,9,claim,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
4660861094,10,claim,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19382 entries, 7017666017 to 8132759688
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_duration_sec        19382 non-null  int64  
 3   video_transcription_text  19084 non-null  object 
 4   verified_status           19382 non-null  object 
 5   author_ban_status         19382 non-null  object 
 6   video_view_count          19084 non-null  float64
 7   video_like_count          19084 non-null  float64
 8   video_share_count         19084 non-null  float64
 9   video_download_count      19084 non-null  float64
 10  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(2), object(4)
memory usage: 1.8+ MB


In [30]:
df.describe()

Unnamed: 0,#,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


**Structure Issues:**
- All *_count variables are in wrong data types. They should be integers
<br><br><br>

## 4. Assess quality of the data using standard data quality dimensions

### 4.1 Uniqueness

In [31]:
print("Number of duplicated values in the data = ", df.duplicated().sum())

Number of duplicated values in the data =  0


### 4.2 consistency

In [32]:
for column in df.columns:
    if df[column].dtype == "object":
        print(f"Value Counts of", df[[column]].value_counts())
        print("\n")

Value Counts of claim_status
claim           9608
opinion         9476
dtype: int64


Value Counts of video_transcription_text                                                                              
someone read  in the media that it snows metal and rains acid on venus                                    2
a friend read  in the media a claim that the eiffel tower leans away from the sun                         2
someone read  in the media a claim that the longest recorded cricket match lasted 14 days                 2
i read  in the media a claim that 20% of the world's oxygen is produced in the amazon jungles             2
someone read  in the media a claim that baked beans are not actually baked                                2
                                                                                                         ..
i learned  in a discussion board that saturn is less dense than water                                     1
i learned  in a discussion board that q

**Note:** The counts of each claim status are quite balanced. There are 9,608 claims and 9,476 opinions.


### 4.3 Relavency

In [33]:
df.head()

Unnamed: 0_level_0,#,claim_status,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
video_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
7017666017,1,claim,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
4014381136,2,claim,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
9859838091,3,claim,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
1866847991,4,claim,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
7105231098,5,claim,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0


**Relavency Issues**:
- \# column is not useful for our problem <br><br><br>

### 4.4 Validity

In [34]:
for column in df.columns:
    if df[column].dtype == 'int':
        print(f"Value Counts of", df[[column]].value_counts(sort=False))
        print("\n")

Value Counts of #    
1        1
2        1
3        1
4        1
5        1
        ..
19378    1
19379    1
19380    1
19381    1
19382    1
Length: 19382, dtype: int64


Value Counts of video_duration_sec
5                     337
6                     378
7                     359
8                     373
9                     319
10                    365
11                    340
12                    345
13                    352
14                    357
15                    367
16                    370
17                    332
18                    300
19                    345
20                    380
21                    321
22                    358
23                    344
24                    352
25                    326
26                    369
27                    331
28                    337
29                    348
30                    324
31                    355
32                    335
33                    322
34                    370
35          

In [35]:
for column in df.columns:
    if df[column].dtype == 'float':
        print(f"Value Counts of", df[[column]].value_counts(sort=False))
        print("\n")

Value Counts of video_view_count
20.0                2
22.0                1
23.0                1
24.0                1
27.0                1
                   ..
999446.0            1
999653.0            1
999655.0            1
999673.0            1
999817.0            1
Length: 15632, dtype: int64


Value Counts of video_like_count
0.0                  4
1.0                 16
2.0                 16
3.0                 16
4.0                 17
                    ..
649695.0             1
653561.0             1
654588.0             1
656243.0             1
657830.0             1
Length: 12224, dtype: int64


Value Counts of video_share_count
0.0                   99
1.0                  144
2.0                  147
3.0                  122
4.0                   95
                    ... 
238004.0               1
240154.0               1
241010.0               1
249672.0               1
256130.0               1
Length: 9231, dtype: int64


Value Counts of video_download_count
0.0 

### 4.5 Completeness

In [41]:
df.isnull().sum()

#                             0
claim_status                298
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64

**Completeness Issues**:
- There are Null values in 298 observations related to all variables except [#, video_duration_sec, verified_status, author_ban_status] 

<br><br><br>
### 5. Select the two most important variables for training our model

In [80]:
df.groupby(["author_ban_status", "claim_status"]).size()

author_ban_status  claim_status
active             claim           6566
                   opinion         8817
banned             claim           1439
                   opinion          196
under review       claim           1603
                   opinion          463
dtype: int64

**Note:** As you can see the counts of each claim status for active users are quite balanced, but they're not for banned and under review users


In [72]:
df.groupby(["author_ban_status", "claim_status"]).agg(["median", "mean"]).iloc[:, 2:]

Unnamed: 0_level_0,Unnamed: 1_level_0,video_duration_sec,video_duration_sec,video_view_count,video_view_count,video_like_count,video_like_count,video_share_count,video_share_count,video_download_count,video_download_count,video_comment_count,video_comment_count
Unnamed: 0_level_1,Unnamed: 1_level_1,median,mean,median,mean,median,mean,median,mean,median,mean,median,mean
author_ban_status,claim_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2
active,claim,33.0,32.633567,499500.5,499221.733171,121943.0,164960.302924,17774.5,32769.101889,1124.5,2048.66951,286.0,687.827901
active,opinion,33.0,32.489395,4958.0,4958.120563,820.0,1091.714982,121.0,217.16695,7.0,13.665986,1.0,2.696609
banned,claim,32.0,32.370396,512572.0,505907.917304,132044.0,173719.102849,19018.0,34056.580959,1204.0,2141.457957,296.0,698.402363
banned,opinion,32.0,31.979592,5083.5,4876.530612,799.5,1027.515306,108.5,208.423469,6.0,12.938776,1.0,2.311224
under review,claim,32.0,31.990643,500774.0,504054.640674,125882.0,165566.95446,18084.0,33155.623206,1153.0,2098.931379,279.0,698.336245
under review,opinion,28.0,30.053996,4884.0,4958.105832,876.0,1139.663067,124.0,220.431965,8.0,14.205184,1.0,2.87689


In [99]:
# Create a likes_per_view column
df["likes_per_view"] = df["video_like_count"] / df["video_view_count"]

# Create a comments_per_view column
df["comments_per_view"] = df["video_comment_count"] / df["video_view_count"]

# Create a shares_per_view column
df["shares_per_view"] = df["video_share_count"] / df["video_view_count"]

In [100]:
df.groupby(["author_ban_status", "claim_status"]).agg(["median", "mean"]).iloc[:, -2*3:]

Unnamed: 0_level_0,Unnamed: 1_level_0,likes_per_view,likes_per_view,comments_per_view,comments_per_view,shares_per_view,shares_per_view
Unnamed: 0_level_1,Unnamed: 1_level_1,median,mean,median,mean,median,mean
author_ban_status,claim_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
active,claim,0.326538,0.329542,0.000776,0.001393,0.049279,0.065456
active,opinion,0.21833,0.219744,0.000252,0.000517,0.032405,0.043729
banned,claim,0.358909,0.345071,0.000746,0.001377,0.051606,0.067893
banned,opinion,0.198483,0.206868,0.000193,0.000434,0.030728,0.040531
under review,claim,0.320867,0.327997,0.000789,0.001367,0.049967,0.065733
under review,opinion,0.228051,0.226394,0.000293,0.000536,0.035027,0.044472


After variables investigation, I see that the author_ban_status and the engagement rates (likes/comments/shares per view) are the most important variables for learning our classification model.