# ***TicTok Project - Differentiating Opinions vc Claims***
<br/>

**The Project Goal** In order to reduce a large backlog of user generated reports that are, which are currently reviewed and moderated manually, we aim to develop a machine learning predictive model to classify claims vs opinions made in videos submitted to platforms. This is expected to reduce backlog and prioritize these contents more efficiently. 

**The Dataset Source** Google Analytics Course - Coursera



## Stage 1 - Data Cleaning and Analysis
**The Purpose** This Jupyter Notebook will investigate and understand the provided data. 

**Part 1** - Understanding the situation and goal.

**Part 2** - Understanding the data.

**Part 3** - Understanding the variables.

### Rough notes
1. include a summary of the column Data types, data value nonnull counts, relevant and irrelevant columns, along with anything else code related you think is worth sharing/showing in the notebook? You’ll need to select a couple of variables to focus on. Include their minimum and maximum values.

In [2]:
# importing necessary libraries
import pandas as pd 
import numpy as np 

In [5]:
# Reading the csv file
data_main = pd.read_csv("tiktok_dataset.csv")

### Part 2 - Understanding the data and inspecting the dataset

In [13]:
data_main.head(10)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


In [9]:
print(f"Shape of dataset is {data_main.shape}")

Shape of dataset is (19382, 12)


This dataset contains 19382 rows and 12 columns. The column names and their description are as follows :
1. "#" - A number assigned to each video by TicTok.
2. "claim_status" - If a video is identified as a claim or an opinion.
3. "video_id" - Random identifying number assigned to a video when published on TicTok.
4. "video_duration_sec" - The length of the video in seconds.
5. "video_transcription_text" - Transcribed texts of words spoken in video
6. "verified_status" -  The status of the maker of video, if they are a verified user on TicTok or not. 
7. "author_ban_status" - Indicating the permissions of the account of the author who published the video. 
8. "video_view_count" - Number of views obtained on the video.
9. "video_like_count" - Number of likes obtained on the video.
10. "video_share_count" - Number of times the video was shared.
11. "video_download_count" - Number of times the video was downloaded.
12. "video_comment_count" - Number of comments obtained on the video.

In [17]:
data_main.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


In [18]:
data_main.isna().sum()

#                             0
claim_status                298
video_id                      0
video_duration_sec            0
video_transcription_text    298
verified_status               0
author_ban_status             0
video_view_count            298
video_like_count            298
video_share_count           298
video_download_count        298
video_comment_count         298
dtype: int64

This dataset contains nulls in columns : claim_status, video_transcription_text, video_view_count, video_like_count,video_share_count, video_download_count and video_comment_count.
<br/>

The dataset columns are a combination of numeric and categorical data types. As an observation, all the columns that have null values happen to have the same number of null values. 

**Question** What kind of data would the rows with null values indicate? Are the videos duplicated? Are the authors duplicated? Are the videos deleted?

In [19]:
data_main.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


1. Columns # and video_id are numeric but do not have values are describe numeric entries. Therefore, their statistics descriptions shall be meaningless. 

2. Videos on TikToc for this dataset range from 5 seconds long to 60 seconds long with average length being 32.4 seconds. This column appears to be normally distributed. 

3. Minimum a video is viewed is 20 times to a maximum of 999,817 times with an average of 254,708.56 and median lying on 9,954.5 (50th percentile). This is a rightly skewed data with maximum of its values lying towards left of mean. This is because the average is relatively high compared to the minimum, suggesting that there are some videos with extremely high view counts pulling the average up, while many videos might have lower views. Also, the large difference between the 75th percentile (504,327 views) and the maximum (999,817 views) further confirms this right skew. Most videos have relatively low views, but a few very popular ones have very high views, pulling the mean upwards.

4. On observing data from other columns, we can conclude that the data is rightly skewed because of similar behavior of other columns.

5. This is a clear indication of outliers being present in the data. 

### Claim Statuses - Options or Claims

In [27]:
# Claim status : Opinion or Claim
data_main.groupby('claim_status').count()

Unnamed: 0_level_0,#,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
claim_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
claim,9608,9608,9608,9608,9608,9608,9608,9608,9608,9608,9608
opinion,9476,9476,9476,9476,9476,9476,9476,9476,9476,9476,9476


There are 9608 videos that are labeled as claims and 9476 that are labeled as opinion. This implies, there are 298 rows with no labels, as stated above and there are more videos labeled as claims than as opinions.

In [34]:
mask_claim = data_main[data_main['claim_status'] == 'claim']
mask_opinion = data_main[data_main['claim_status'] == 'opinion']

In [36]:
print(f"The average view count of videos with claim as status is {mask_claim['video_view_count'].mean()}.")
print(f"The average view count of videos with opinion as status is {mask_opinion['video_view_count'].mean()}.")

The average view count of videos with claim as status is 501029.4527477102.
The average view count of videos with opinion as status is 4956.43224989447.


In [41]:
print(f"The median view of videos with claim as status is {mask_claim['video_view_count'].median()}.")
print(f"The median view of videos with opinion as status is {mask_opinion['video_view_count'].median()}.")

The median view of videos with claim as status is 501555.0.
The median view of videos with opinion as status is 4953.0.


The average views on videos with status as claim is 501,029.45 and that on videos with status as opinion is 4,956.43 which is 496073.02049781574 more views on videos with status claim, even though there are only 132 claim status videos more than opinion status videos. 

This suggests that videos with "claim" status tend to attract significantly more attention or engagement compared to those labeled as "opinion." However, it's worth exploring why "claim" videos might get more views—factors such as content type, topic relevance, or viewer interest could contribute to this disparity.

The median view on videos with status as claim is 501,555 views and that on videos with status as opinion is 4,953 views. These values are very close to their respective mean values.

Thus, individually, both the distributions are likely to be symmetric.



### Statuses of Authors - Active, Banned and Under review

In [46]:
# Ban status of Author
data_main.groupby('author_ban_status').count()

Unnamed: 0_level_0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
author_ban_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
active,15663,15383,15663,15663,15383,15663,15383,15383,15383,15383,15383
banned,1639,1635,1639,1639,1635,1639,1635,1635,1635,1635,1635
under review,2080,2066,2080,2080,2066,2080,2066,2066,2066,2066,2066


Out of 19382 rows with statuses about authors, 15,663 are active accounts, 1,639 are banned and 2,080 are under review.

In [49]:
mask_active = data_main[data_main['author_ban_status'] == 'active']
mask_ban = data_main[data_main['author_ban_status'] == 'banned']
mask_review = data_main[data_main['author_ban_status'] == 'under review']

In [50]:
print(f"The average view count of videos with authors having active account is {mask_active['video_view_count'].mean()}.")
print(f"The average view count of videos with authors having banned account is {mask_ban['video_view_count'].mean()}.")
print(f"The average view count of videos with authors having account that are under review is {mask_review['video_view_count'].mean()}.")

The average view count of videos with authors having active account is 215927.03952415002.
The average view count of videos with authors having banned account is 445845.4391437309.
The average view count of videos with authors having account that are under review is 392204.8363988383.


In [82]:
print(f"The median view of videos with authors having active account is {mask_active['video_view_count'].median()}.")
print(f"The median view of videos with authors having banned account is {mask_ban['video_view_count'].median()}.")
print(f"The median view of videos with authors having account that are under review is {mask_review['video_view_count'].median()}.")

The median view of videos with authors having active account is 8616.0.
The median view of videos with authors having banned account is 448201.0.
The median view of videos with authors having account that are under review is 365245.5.


The average engagement values of active accounts, banned accounts and accounts under review are 215,927.03, 445,845.44 and 392,204.84 respectively. Similarly, their median viewership values are 8,616, 448,201 and 365,245.5. These values suggest that the banned author accounts receive the maximum viewership and engagement with accounts that are under review as second and authors with active account as last. 

The large difference between the mean engagement (215,927.03) and median viewership (8,616) for active accounts indicates possible skewness. The mean being significantly higher than the median suggests that a few outliers with very high engagement are inflating the average. In contrast, the data for banned accounts and accounts under review seems more consistent, with their medians relatively close to their mean values.

In [91]:
data_main.groupby(['claim_status', 'author_ban_status'])['video_view_count'].mean(numeric_only=True)

claim_status  author_ban_status
claim         active               499221.733171
              banned               505907.917304
              under review         504054.640674
opinion       active                 4958.120563
              banned                 4876.530612
              under review           4958.105832
Name: video_view_count, dtype: float64

In [90]:
data_main.groupby(['claim_status', 'author_ban_status'])['video_id'].count()

claim_status  author_ban_status
claim         active               6566
              banned               1439
              under review         1603
opinion       active               8817
              banned                196
              under review          463
Name: video_id, dtype: int64

Out of 1639 total banned videos, 1439 banned videos are given status as claim and only 196 videos have status as opinion. 