## Identifying TikTok posts that could be news

Author: Jasmine Khuu

Amount of posts that contain hashtags from our [list of hashtags](https://docs.google.com/spreadsheets/d/1X6mvIKTQo-AQfER5OFJg7kPVphx4nM3rM67-NnCtGt0/edit#gid=0).


In [5]:
import pandas as pd

hashtags_df = pd.read_csv('News-related hashtags - 2.csv', dtype=str)
results_df = pd.read_csv('results_all.csv', dtype=str)

# extract hashtags from dataframe
hashtags_list = hashtags_df['Hashtags'].tolist()

# extract hashtags from video description
def extract_hashtags(description):
    hashtags = []
    for word in description.split():
        if word.startswith('#'):
            hashtags.append(word[1:])
    return hashtags

# create new dataframe
posts_df = results_df[['video_id', 'video_description']]
posts_df['video_description'] = posts_df['video_description'].astype(str)
posts_df['hashtags'] = posts_df['video_description'].apply(extract_hashtags)

# new column
posts_df['found_hashtags'] = ''

for index, row in posts_df.iterrows():
    found_hashtags = []
    for hashtag in row['hashtags']:
        if hashtag in set(hashtags_list): # check if hashtag in list
            found_hashtags.append(hashtag)
    posts_df.at[index, 'found_hashtags'] = found_hashtags # updating column

# filter out rows where no hashtags are found
filtered_posts_df = posts_df[posts_df['found_hashtags'].apply(lambda x: len(x) > 0)]

# save to csv only if if there are found hashtags
if not filtered_posts_df.empty:
    filtered_posts_df[['video_id', 'found_hashtags', 'video_description']].to_csv('newsposts_data.csv', index=False)

filtered_posts_df


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  posts_df['video_description'] = posts_df['video_description'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  posts_df['hashtags'] = posts_df['video_description'].apply(extract_hashtags)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  posts_df['found_hashtags'] = ''


Unnamed: 0,video_id,video_description,hashtags,found_hashtags
10,7293175043465497889,things ex staff said about kpop idols pt 3 #kp...,"[kpop, kpopfyp, kpopviral, viral, fyp, kpopfac...",[kpop]
21,7300414739799084321,video funny & cute #funnyvideos #meme #cute #...,"[funnyvideos, meme, cute, fyp, viral, humor, usa]",[usa]
73,7297053362237852970,Number 1 is surprising 😳🙈🧐 #usa #tiktok #histo...,"[usa, tiktok, history, fastfood, map, trend, p...",[usa]
270,7311421071389887776,Creepy facts about south korea 🇰🇷 #seoullife #...,"[seoullife, korea, southkorea, seoul, koreatra...",[kdrama]
272,7305849562822872352,I was on the train at 2 am 🫠🫠 #seoullife #kore...,"[seoullife, korea, southkorea, seoul, koreatra...",[kdrama]
...,...,...,...,...
9297,7.30595E+18,#nyt #nytconnections #newyorktimes,"[nyt, nytconnections, newyorktimes]",[nyt]
9302,7.3152E+18,Bro did NOT pronounce it correctly💀💀💀#fortnite...,"[school, baseball, basketball, videoviral, tre...",[football]
9313,7.30217E+18,Do you know why the international date line lo...,"[reallifelore, lifelore, education, geography,...",[education]
9322,7.29588E+18,Most hated vs most loved member in kpop groups...,"[kpop, kpopfyp, viral, kpopfications, trending]",[kpop]


In [3]:
# Initialize a new column to store found hashtags
posts_df['found_hashtags'] = ''

# Iterate through each row in the 'hashtags' column
for index, row in posts_df.iterrows():
    found_hashtags = []
    # Iterate through each hashtag in the row
    for hashtag in row['hashtags']:
        # Check if the hashtag is in the hashtags list
        if hashtag in set(hashtags_list):
            found_hashtags.append(hashtag)
    # Update the 'found_hashtags' column with the found hashtags for the current row
    posts_df.at[index, 'found_hashtags'] = found_hashtags

posts_df[['video_id','found_hashtags','video_description']].to_csv('newsposts_data.csv', index=False)

# Display the modified DataFrame
posts_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  posts_df['found_hashtags'] = ''


Unnamed: 0,video_id,video_description,hashtags,found_hashtags
0,7.315229e+18,Replying to @ffredinho_23 WIFE’S PERSPECTIVE -...,"[reddit, redditstories, redditreadings, askred...",[]
1,7.315223e+18,Replying to @connormalloy5 Update - My WIFE ha...,"[reddit, redditstories, redditreadings, askred...",[]
2,7.315212e+18,My WIFE has been INSISTING that we have a 3 WA...,"[reddit, redditstories, redditreadings, askred...",[]
3,7.315089e+18,🤷🏼‍♀️🤷🏼‍♀️ #dateideas #redthoughts #viral #love,"[dateideas, redthoughts, viral, love]",[]
4,7.306701e+18,For y'all who always want to see more. We've g...,"[greatdane, doggrooming, grossola, fypシ, satis...",[]
...,...,...,...,...
9325,7.300560e+18,,[],[]
9326,7.307380e+18,Stay out of the woods #scaryvideos #skinwalke...,"[scaryvideos, skinwalker, scarystories, backro...",[]
9327,7.294470e+18,-KPOP IDOLS WHO WERE FORCED TO DO SOMETHING BY...,"[fyp, kpop, yuna, momo, momoland, bahiyyih, so...",[kpop]
9328,7.303360e+18,,[],[]


In [4]:
import pandas as pd

# Read the CSV files
hashtags_df = pd.read_csv('News-related hashtags - 2.csv')
results_df = pd.read_csv('results_all.csv')

# Extract hashtags from the hashtags dataframe
hashtags_list = hashtags_df['Hashtags'].tolist()

# Extract hashtags from video descriptions
def extract_hashtags(description):
    hashtags = []
    for word in description.split():
        if word.startswith('#'):
            hashtags.append(word[1:])
    return hashtags

# Create a dataframe for posts
posts_df = results_df[['video_id', 'video_description']]
posts_df['video_description'] = posts_df['video_description'].astype(str)

# Apply the function to create a new column 'hashtags'
posts_df['hashtags'] = posts_df['video_description'].apply(extract_hashtags)

# Initialize a new column to store found hashtags
posts_df['found_hashtags'] = ''

# Iterate through each row in the 'hashtags' column
for index, row in posts_df.iterrows():
    found_hashtags = []
    # Iterate through each hashtag in the row
    for hashtag in row['hashtags']:
        # Check if the hashtag is in the hashtags list
        if hashtag in set(hashtags_list):
            found_hashtags.append(hashtag)
    # Update the 'found_hashtags' column with the found hashtags for the current row
    posts_df.at[index, 'found_hashtags'] = found_hashtags

# Filter out rows where no hashtags are found
filtered_posts_df = posts_df[posts_df['found_hashtags'].apply(lambda x: len(x) > 0)]

# Save to CSV only if there are found hashtags
if not filtered_posts_df.empty:
    filtered_posts_df[['video_id', 'found_hashtags', 'video_description']].to_csv('newsbyhashtag_data.csv', index=False)

# Display the modified DataFrame
print(filtered_posts_df)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  posts_df['video_description'] = posts_df['video_description'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  posts_df['hashtags'] = posts_df['video_description'].apply(extract_hashtags)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  posts_df['found_hashtags'] = ''


          video_id                                  video_description  \
10    7.293175e+18  things ex staff said about kpop idols pt 3 #kp...   
21    7.300415e+18  video funny & cute  #funnyvideos #meme #cute #...   
73    7.297053e+18  Number 1 is surprising 😳🙈🧐 #usa #tiktok #histo...   
270   7.311421e+18  Creepy facts about south korea 🇰🇷 #seoullife #...   
272   7.305850e+18  I was on the train at 2 am 🫠🫠 #seoullife #kore...   
...            ...                                                ...   
9297  7.305950e+18                #nyt #nytconnections #newyorktimes    
9302  7.315200e+18  Bro did NOT pronounce it correctly💀💀💀#fortnite...   
9313  7.302170e+18  Do you know why the international date line lo...   
9322  7.295880e+18  Most hated vs most loved member in kpop groups...   
9327  7.294470e+18  -KPOP IDOLS WHO WERE FORCED TO DO SOMETHING BY...   

                                               hashtags found_hashtags  
10    [kpop, kpopfyp, kpopviral, viral, fyp, kpopf