## TikTok Metadata Cleaning Notebook

### Setup

In [1]:
import pandas as pd

### Load Data

In [2]:
# import data posts.json

df_Biden = pd.read_json('data/AIBiden.json')
df_Biden.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288 entries, 0 to 287
Data columns (total 44 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   AIGCDescription        288 non-null    object 
 1   BAInfo                 288 non-null    object 
 2   adAuthorization        288 non-null    bool   
 3   adLabelVersion         288 non-null    int64  
 4   aigcLabelType          288 non-null    int64  
 5   author                 288 non-null    object 
 6   authorStats            288 non-null    object 
 7   challenges             288 non-null    object 
 8   collected              288 non-null    bool   
 9   contents               286 non-null    object 
 10  createTime             288 non-null    int64  
 11  desc                   288 non-null    object 
 12  digged                 288 non-null    bool   
 13  diversificationId      214 non-null    float64
 14  duetDisplay            288 non-null    int64  
 15  duetEn

In [3]:
df_Trump = pd.read_json('data/AITrump.json')
df_Trump.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 489 entries, 0 to 488
Data columns (total 45 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   AIGCDescription        489 non-null    object 
 1   BAInfo                 489 non-null    object 
 2   adAuthorization        489 non-null    bool   
 3   adLabelVersion         489 non-null    int64  
 4   aigcLabelType          489 non-null    int64  
 5   author                 489 non-null    object 
 6   authorStats            489 non-null    object 
 7   challenges             489 non-null    object 
 8   collected              489 non-null    bool   
 9   contents               488 non-null    object 
 10  createTime             489 non-null    int64  
 11  desc                   489 non-null    object 
 12  digged                 489 non-null    bool   
 13  duetDisplay            489 non-null    int64  
 14  duetEnabled            489 non-null    bool   
 15  duetIn

In [4]:
# Combine the two dataframes 
df = pd.concat([df_Biden, df_Trump], ignore_index=True)
len(df)

777

### Field Description:
* AIGCDescription: This is likely an object containing information related to AI-generated content.
* BAInfo: This object may contain information about the business or advertiser associated with the content.
* adAuthorization: A boolean value indicating whether the content is authorized as an advertisement.
* adLabelVersion: An integer representing the version of the ad label used for the content.
* aigcLabelType: An integer indicating the type of AI-generated content label used.
* author: An object containing information about the author or creator of the content.
* authorStats: An object with statistics related to the author.
* challenges: An object containing information about any challenges or contests associated with the content.
* collected: A boolean value indicating whether the content has been collected or saved by the user.
* contents: An object containing the actual content, such as text, images, or videos.
* createTime: A datetime value representing the time when the content was created.
* desc: An object containing a description or summary of the content.
* digged: A boolean value indicating whether the user has "digged" or liked the content.
* diversificationId: A float value, possibly used for content diversification or recommendation purposes.
* duetDisplay: An integer value related to the display settings for duet (split-screen) videos.
* duetEnabled: A boolean value indicating whether duet videos are enabled for the content.
* duetInfo: An object containing information about duet videos, if applicable.
* forFriend: A boolean value indicating whether the content is intended for a specific friend or group.
* id: An integer value representing the unique identifier for the content.
* isAd: A boolean value indicating whether the content is an advertisement.
* itemCommentStatus: An integer value representing the comment status for the content.
* itemMute: A boolean value indicating whether the content is muted.
* music: An object containing information about the music or audio associated with the content.
* officalItem: A boolean value indicating whether the content is an official item or from an official source.
* originalItem: A boolean value indicating whether the content is an original item or a repost/share.
* playlistId: An object representing the playlist ID, if the content is part of a playlist.
* privateItem: A boolean value indicating whether the content is private or public.
* secret: A boolean value, possibly indicating whether the content is secret or confidential.
* shareEnabled: A boolean value indicating whether sharing is enabled for the content.
* showNotPass: A boolean value, potentially related to content moderation or filtering.
* stats: An object containing statistical information about the content, such as views, likes, or shares.
* statsV2: Another object containing statistical information, possibly a newer or different version.
* stitchDisplay: An integer value related to the display settings for stitch (combined) videos.
* stitchEnabled: A boolean value indicating whether stitch videos are enabled for the content.
* textExtra: An object containing additional text-related information or metadata.
* video: An object containing information about the video file, if the content is a video.
* vl1: A boolean value, possibly indicating whether the content is verified or from a trusted source.
* stickersOnItem: An object containing information about stickers or overlays applied to the content.
* imagePost: An object containing information about the image, if the content is an image post.
* anchors: An object with information about anchors or locations associated with the content.
* poi: An object containing points of interest (POI) related to the content.
* effectStickers: An object with information about effect stickers or filters applied to the content.
* item_control: An object containing control settings or permissions for the content.
* warnInfo: An object with warning or advisory information related to the content.

### Save new dataframe

In [5]:
# New df
# AI labeled content 1 = true, 0 = false

df2 = pd.DataFrame(columns=['id', 'uniqueId', 'signature', 'followerCount', 'aigcLabelType', 'createTime', 'desc', 'playCount', 'shareCount'])
df2['id'] = df.author.apply(lambda x: x['id'])
df2['uniqueId'] = df.author.apply(lambda x: x['uniqueId'])
df2['signature'] = df.author.apply(lambda x: x['signature'])
df2['followerCount'] = df.authorStats.apply(lambda x: x['followerCount'])
df2['aigcLabelType'] = df.aigcLabelType.map({1: True, 0: False})
df2['createTime'] = pd.to_datetime(df.createTime, unit='s')
df2['desc'] = df.desc
df2['playCount'] = df.statsV2.apply(lambda x: x['playCount'])
df2['shareCount'] = df.statsV2.apply(lambda x: x['shareCount'])
df2['videoId'] = df.video.apply(lambda x: x['id'])
df2['link'] = 'https://www.tiktok.com/@' + df2['uniqueId'] + '/video/' + df2['videoId']

len(df2)


777

## Remove Duplicates

In [7]:
# remove duplicate rows based on 'id'

df2.drop_duplicates(subset='videoId', keep='first', inplace=True)
len(df2)

df2.to_csv('data/tiktok.csv', index=False)