# Analysis of Description and Suggest Words

Notebook to analyze text description + suggested words come the closest to NYT headlines for that day (as calculated by running cosine similarity on the text embeddings of both groups). [You’ll manually inspect some posts to define the range of cosine similarity that makes sense the most.]

**Author: Audrey Yip**

**Table of Contents**
1. [Read in data](#1)
2. [Process description and suggested words](#2)
3. [Create .csvs and get headline and keyword data](#3)     
4. [Cosine similarity with Semantic Analysis](#4)

### 1. Read in user data <a class="anchor" id="1"></a>

This section will be fixed once we have all metadata

In [8]:
import os
import pandas as pd

In [9]:
# fix when we have metadata for all videos!
cwd = os.getcwd()
metadata_dir = f'{cwd}/../pre-processing/metadata-csv'
metadata_files = [file for file in os.listdir(metadata_dir) if file.endswith(".csv")]

dataframes = []

for file in metadata_files:
    file_path = os.path.join(metadata_dir, file)
    df = pd.read_csv(file_path)
    dataframes.append(df)

combined_df = pd.concat(dataframes, ignore_index=True)
combined_df.head()

Unnamed: 0,video_id,video_timestamp,video_duration,video_locationcreated,suggested_words,video_diggcount,video_sharecount,video_commentcount,video_playcount,video_description,video_is_ad,video_stickers,author_username,author_name,author_followercount,author_followingcount,author_heartcount,author_videocount,author_diggcount,author_verified
0,7273221955937914155,2023-08-30T16:56:01,37.0,US,"angels in tibet, angels in tibet dance, angels...",356300.0,5606.0,986.0,2000000.0,Replying to @jade🐉not perfect yet & i made a ...,False,,thebeaulexx,beaulexx,,,,,,False
1,7273221955937914155,2023-08-30T16:56:01,37.0,US,"angels in tibet, angels in tibet dance, angels...",356300.0,5606.0,986.0,2000000.0,Replying to @jade🐉not perfect yet & i made a ...,False,,thebeaulexx,beaulexx,,,,,,False
2,7283080657893379334,2023-09-26T06:32:40,15.0,PH,"angels in tibet, Jam Republic, angels in tibet...",419100.0,3518.0,708.0,2600000.0,🧠🧠🧠,False,,clarkie_cpm,Clarkie,,,,,,False
3,7273221955937914155,2023-08-30T16:56:01,37.0,US,"angels in tibet, angels in tibet dance, angels...",356300.0,5606.0,986.0,2000000.0,Replying to @jade🐉not perfect yet & i made a ...,False,,thebeaulexx,beaulexx,,,,,,False
4,7285397643725983008,2023-10-02T12:23:48,37.0,US,"Dream Academy, angels in tibet, Adela Dream Ac...",142700.0,1373.0,551.0,1000000.0,s/o to dream academy for teaching me how to da...,False,,adelajergova,ADÉLA,,,,,,False


### 2. Process video_description and suggested_words <a class="anchor" id="2"></a>

In [10]:
# use Eni's csv for now to practice analysis

## filter out videos that are not created in us

# access relevant columns
df_filtered = combined_df[['video_id', 'video_timestamp', 'video_description', 'suggested_words']]

# only take unique videos
df_filtered_no_dup = df_filtered.drop_duplicates(subset=['video_id'])
df_filtered_no_dup.head()


Unnamed: 0,video_id,video_timestamp,video_description,suggested_words
0,7273221955937914155,2023-08-30T16:56:01,Replying to @jade🐉not perfect yet & i made a ...,"angels in tibet, angels in tibet dance, angels..."
2,7283080657893379334,2023-09-26T06:32:40,🧠🧠🧠,"angels in tibet, Jam Republic, angels in tibet..."
4,7285397643725983008,2023-10-02T12:23:48,s/o to dream academy for teaching me how to da...,"Dream Academy, angels in tibet, Adela Dream Ac..."
5,7231292396573641991,2023-05-09T17:07:50,i hate my job pt 8 i think bro its 5 am goofy ...,
6,7235928808166100242,2023-05-22T04:59:26,i got my eyes on food,


In [11]:
# create new column with just the dates
df_filtered_no_dup['video_date'] = df_filtered_no_dup['video_timestamp'].str[:10]
df_filtered_no_dup.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered_no_dup['video_date'] = df_filtered_no_dup['video_timestamp'].str[:10]


Unnamed: 0,video_id,video_timestamp,video_description,suggested_words,video_date
0,7273221955937914155,2023-08-30T16:56:01,Replying to @jade🐉not perfect yet & i made a ...,"angels in tibet, angels in tibet dance, angels...",2023-08-30
2,7283080657893379334,2023-09-26T06:32:40,🧠🧠🧠,"angels in tibet, Jam Republic, angels in tibet...",2023-09-26
4,7285397643725983008,2023-10-02T12:23:48,s/o to dream academy for teaching me how to da...,"Dream Academy, angels in tibet, Adela Dream Ac...",2023-10-02
5,7231292396573641991,2023-05-09T17:07:50,i hate my job pt 8 i think bro its 5 am goofy ...,,2023-05-09
6,7235928808166100242,2023-05-22T04:59:26,i got my eyes on food,,2023-05-22


In [15]:
import re
from nltk.corpus import stopwords

# load stop_words
stop_words = set(stopwords.words('english'))

# create list of hashtags to omit, we will not expect to find these in headlines
stop_hashtags = ['fyp', 'foryou']

def clean_description(description):
    """
    Helper function, takes video description and splits into words, removes punctuation, emojis and stop words.
    """
    if pd.isna(description):  
        return [] 
    
    # remove numbers from the text
    description = re.sub(r'\d+', '', description)

    # split the description into words
    words = description.split()
    
    # remove punctuation and emojis, make everything lowercase
    cleaned_words = [re.sub(r'[^\w\s]', '', word).lower() for word in words]
    
    # remove stop words and words containing stop hashtags
    cleaned_words = [word for word in cleaned_words if word not in stop_words and not any(stop_tag in word for stop_tag in stop_hashtags)]

    # remove empty strings
    cleaned_words = [word for word in cleaned_words if word]
    
    return cleaned_words


In [16]:
# test clean description
clean_description('love Love LOVE this little life 😍 now i gotta jump through hoops for sh i already had ☺️🤍😍🫶😚 #greenscreen #americacore🚘🏈🍔🇺🇸 #americacore🥺💗 #foryou #robbed #stolenwallet #fypシ #fypp')

['love',
 'love',
 'love',
 'little',
 'life',
 'gotta',
 'jump',
 'hoops',
 'sh',
 'already',
 'greenscreen',
 'americacore',
 'americacore',
 'robbed',
 'stolenwallet']

In [17]:
def clean_sugg_words(sugg_words):
    """
    Helper function, takes suggested words and splits into words, converts to lowercase and removes stop words.
    """
    if pd.isna(sugg_words):  
        return [] 
    
    # remove numbers from the text
    sugg_words = re.sub(r'\d+', '', sugg_words)
    
    # split the suggested words into individual words
    words = sugg_words.split(',')
    
    # split each word by white space
    words = [sub_word.strip().lower() for word in words for sub_word in word.split()]

    # convert each word to lowercase and remove  whitespace
    words = [word.strip().lower() for word in words]
    
    # remove stop words
    cleaned_words = [word for word in words if word not in stop_words]
    
    return cleaned_words

In [18]:
# test suggested words function
sugg_words = "things they dont prepare you for as a big sister, things they don't prepare you for as a big brother, things they dont prepare you for as a little sister, what they don't prepare you for as an older sister, older sister and little brother, things they don't prepare you for as a little brother, things they dont prepare you for as an older sister, things they don't prepare you as a big sister, Big Sister And Little Brother, me and my brother"
clean_sugg_words(sugg_words)

['things',
 'dont',
 'prepare',
 'big',
 'sister',
 'things',
 'prepare',
 'big',
 'brother',
 'things',
 'dont',
 'prepare',
 'little',
 'sister',
 'prepare',
 'older',
 'sister',
 'older',
 'sister',
 'little',
 'brother',
 'things',
 'prepare',
 'little',
 'brother',
 'things',
 'dont',
 'prepare',
 'older',
 'sister',
 'things',
 'prepare',
 'big',
 'sister',
 'big',
 'sister',
 'little',
 'brother',
 'brother']

In [19]:
# apply helper functions to create new column in dataframe
df_filtered_no_dup['keywords'] = df_filtered_no_dup['suggested_words'].apply(clean_sugg_words) + df_filtered_no_dup['video_description'].apply(clean_description)
df_filtered_no_dup.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered_no_dup['keywords'] = df_filtered_no_dup['suggested_words'].apply(clean_sugg_words) + df_filtered_no_dup['video_description'].apply(clean_description)


Unnamed: 0,video_id,video_timestamp,video_description,suggested_words,video_date,keywords
0,7273221955937914155,2023-08-30T16:56:01,Replying to @jade🐉not perfect yet & i made a ...,"angels in tibet, angels in tibet dance, angels...",2023-08-30,"[angels, tibet, angels, tibet, dance, angels, ..."
2,7283080657893379334,2023-09-26T06:32:40,🧠🧠🧠,"angels in tibet, Jam Republic, angels in tibet...",2023-09-26,"[angels, tibet, jam, republic, angels, tibet, ..."
4,7285397643725983008,2023-10-02T12:23:48,s/o to dream academy for teaching me how to da...,"Dream Academy, angels in tibet, Adela Dream Ac...",2023-10-02,"[dream, academy, angels, tibet, adela, dream, ..."
5,7231292396573641991,2023-05-09T17:07:50,i hate my job pt 8 i think bro its 5 am goofy ...,,2023-05-09,"[hate, job, pt, think, bro, goofy, ass, audio]"
6,7235928808166100242,2023-05-22T04:59:26,i got my eyes on food,,2023-05-22,"[got, eyes, food]"


### 3. Create .csvs and get headline and keyword data <a class="anchor" id="3"></a>

In [20]:
from get_nyt_articles import filter_by_date

In [21]:
df = filter_by_date('2023-12-22')


Current working directory: /Users/audreyyip/Desktop/CS 315/Project 2/Project 2 Repo/analysis
NYT directory: /Users/audreyyip/Desktop/CS 315/Project 2/Project 2 Repo/analysis/../pre-processing/nyt-articles
NYT data for 2023-12-22 already in folder


In [22]:
# test analysis with this df
df.head()

Unnamed: 0.1,Unnamed: 0,abstract,lead_paragraph,headline,pub_date,document_type,section_name,type_of_material,keywords
0,0,"Bond prices are up. Nobody knows why, but it’s...",It’s been a strange few days on the Donald Tru...,A Christmas Gift From the Bond Market,2023-12-22T00:00:08+0000,article,Opinion,Op-Ed,United States Economy;Stocks and Bonds;Inflati...
1,1,"Some lawmakers are likely to oppose the move, ...",The Biden administration is preparing to relax...,U.S. Prepares to Lift Ban on Sales of Offensiv...,2023-12-22T00:24:41+0000,article,U.S.,News,United States Politics and Government;United S...
2,2,The successful petition is one of the few rece...,"The widow of Jamal Khashoggi, the Washington P...",Widow of Jamal Khashoggi Is Granted Political ...,2023-12-22T00:29:29+0000,article,U.S.,News,"Khashoggi, Jamal;Khashoggi, Hanan Elatr;Asylum..."
3,3,The officers were charged over the 2020 death ...,A jury found three Tacoma police officers not ...,3 Tacoma Police Officers Cleared in Death of a...,2023-12-22T01:04:18+0000,article,U.S.,News,"Police Brutality, Misconduct and Shootings;Ell..."
4,4,The company that operates Pornhub and other ad...,The company that operates Pornhub and other ad...,Pornhub’s Parent Company Admits to Profiting F...,2023-12-22T01:35:18+0000,article,New York,News,Compensation for Damages (Law);Pornhub;Pornogr...


In [23]:
def split_keywords(text):
    """Split text into individual keywords based on whitespace and punctuation, remove stop words"""
    if pd.isna(text):  # check if text is NaN
        return []  
    
    # split text into individual keywords based on whitespace and punctuation
    keywords = re.findall(r'\b\w+\b', text)

    # remove stop words
    cleaned_words = [word.lower() for word in keywords if word not in stop_words]
    
    return cleaned_words

def clean_headline(text):
    """Split headline into individual words based on whitespace and punctuation, remove stop words"""
    if pd.isna(text):  # check if text is NaN
        return []  
    
    # remove numbers from the text
    text = re.sub(r'\d+', '', text)

    # split text into individual keywords based on whitespace and punctuation
    keywords = text.split()

    # remove stop words
    cleaned_words = [word.lower() for word in keywords if word not in stop_words]
    
    return cleaned_words


In [24]:
# test functions
print(split_keywords('Compensation for Damages (Law);Pornhub;'))

print(clean_headline('3 Tacoma Police Officers Cleared in Death of a'))

['compensation', 'damages', 'law', 'pornhub']
['tacoma', 'police', 'officers', 'cleared', 'death']


In [25]:
test_df = df_filtered_no_dup.head()
test_df = test_df[['video_id', 'video_date', 'keywords']]
test_df 

Unnamed: 0,video_id,video_date,keywords
0,7273221955937914155,2023-08-30,"[angels, tibet, angels, tibet, dance, angels, ..."
2,7283080657893379334,2023-09-26,"[angels, tibet, jam, republic, angels, tibet, ..."
4,7285397643725983008,2023-10-02,"[dream, academy, angels, tibet, adela, dream, ..."
5,7231292396573641991,2023-05-09,"[hate, job, pt, think, bro, goofy, ass, audio]"
6,7235928808166100242,2023-05-22,"[got, eyes, food]"


In [26]:
import pandas as pd

# Create an empty list to store rows
rows = []

# Iterate over each row in the video DataFrame
for index, row in test_df.iterrows():
    # Get relevant NYT articles using the video date
    nyt_df = filter_by_date(row['video_date'])

    # get relevant keywords and combine lists
    keywords_list = nyt_df['keywords'].apply(split_keywords)
    headline_list = nyt_df['headline'].apply(clean_headline)

    combined_list = [keyword + headline for keyword, headline in zip(keywords_list, headline_list)]
    flat_list = []
    [flat_list.extend(item) for item in combined_list]
    
    # Process the data and create a new row
    new_row = {
        'video_id': row['video_id'],
        'video_date': row['video_date'],
        'video_keywords': row['keywords'],
        'nyt_keywords': flat_list
    }
    
    # Append the new row to the list of rows
    rows.append(new_row)

# Create DataFrame from the list of rows
comparison_df = pd.DataFrame(rows)

# Display the resulting DataFrame
comparison_df


Current working directory: /Users/audreyyip/Desktop/CS 315/Project 2/Project 2 Repo/analysis
NYT directory: /Users/audreyyip/Desktop/CS 315/Project 2/Project 2 Repo/analysis/../pre-processing/nyt-articles
NYT data for 2023-08-30 already in folder
Current working directory: /Users/audreyyip/Desktop/CS 315/Project 2/Project 2 Repo/analysis
NYT directory: /Users/audreyyip/Desktop/CS 315/Project 2/Project 2 Repo/analysis/../pre-processing/nyt-articles
NYT data for 2023-09-26 already in folder
Current working directory: /Users/audreyyip/Desktop/CS 315/Project 2/Project 2 Repo/analysis
NYT directory: /Users/audreyyip/Desktop/CS 315/Project 2/Project 2 Repo/analysis/../pre-processing/nyt-articles
NYT data for 2023-10-02 already in folder
Current working directory: /Users/audreyyip/Desktop/CS 315/Project 2/Project 2 Repo/analysis
NYT directory: /Users/audreyyip/Desktop/CS 315/Project 2/Project 2 Repo/analysis/../pre-processing/nyt-articles
NYT data for 2023-05-09 already in folder
Current work

Unnamed: 0,video_id,video_date,video_keywords,nyt_keywords
0,7273221955937914155,2023-08-30,"[angels, tibet, angels, tibet, dance, angels, ...","[floyd, willie, lewis, iii, trump, donald, j, ..."
1,7283080657893379334,2023-09-26,"[angels, tibet, jam, republic, angels, tibet, ...","[presidential, election, 2024, debates, politi..."
2,7285397643725983008,2023-10-02,"[dream, academy, angels, tibet, adela, dream, ...","[national, parks, monuments, seashores, bears,..."
3,7231292396573641991,2023-05-09,"[hate, job, pt, think, bro, goofy, ass, audio]","[canada, china, politics, government, diplomat..."
4,7235928808166100242,2023-05-22,"[got, eyes, food]","[hiring, promotion, writing, writers, names, p..."


In [27]:
comparison_df['video_sentences'] = comparison_df.apply(lambda row: " ".join(row['video_keywords']), axis=1)
comparison_df['nyt_sentences'] = comparison_df.apply(lambda row: " ".join(row['nyt_keywords']), axis=1)

comparison_df.head()

Unnamed: 0,video_id,video_date,video_keywords,nyt_keywords,video_sentences,nyt_sentences
0,7273221955937914155,2023-08-30,"[angels, tibet, angels, tibet, dance, angels, ...","[floyd, willie, lewis, iii, trump, donald, j, ...",angels tibet angels tibet dance angels tibet s...,floyd willie lewis iii trump donald j willis f...
1,7283080657893379334,2023-09-26,"[angels, tibet, jam, republic, angels, tibet, ...","[presidential, election, 2024, debates, politi...",angels tibet jam republic angels tibet tutoria...,presidential election 2024 debates political r...
2,7285397643725983008,2023-10-02,"[dream, academy, angels, tibet, adela, dream, ...","[national, parks, monuments, seashores, bears,...",dream academy angels tibet adela dream academy...,national parks monuments seashores bears death...
3,7231292396573641991,2023-05-09,"[hate, job, pt, think, bro, goofy, ass, audio]","[canada, china, politics, government, diplomat...",hate job pt think bro goofy ass audio,canada china politics government diplomatic se...
4,7235928808166100242,2023-05-22,"[got, eyes, food]","[hiring, promotion, writing, writers, names, p...",got eyes food,hiring promotion writing writers names persona...


### 4. Cosine similarity with Semantic Analysis <a class="anchor" id="1"></a>

In [32]:
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np

2024-03-12 22:46:02.604348: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [33]:
# load the Universal Sentence Encoder's TF Hub module
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

In [42]:
# cosine similarity function, from week 7 notebook
from numpy.linalg import norm

def cosineSimilarity(vec1, vec2):
    """Calculate the cosine similarity between two vectors."""
    V1 = np.array(vec1)
    V2 = np.array(vec2)
    cosine = np.dot(V1, V2)/(norm(V1)*norm(V2))
    return cosine

In [51]:
embed(["dream academy angels tibet adela dream academy"])

<tf.Tensor: shape=(1, 512), dtype=float32, numpy=
array([[ 5.02408668e-02, -5.72159961e-02,  3.88645865e-02,
        -6.19135611e-02, -3.03740706e-02,  5.70509210e-02,
        -6.21422641e-02, -2.79508382e-02, -6.55337051e-02,
        -6.32722080e-02,  2.48609297e-02,  6.93009347e-02,
         1.28965145e-02, -3.27968933e-02,  5.60682714e-02,
        -5.69079779e-02, -2.20257565e-02,  6.48708567e-02,
         3.46173085e-02, -6.87998310e-02,  6.79460689e-02,
         5.95044978e-02,  6.92275316e-02,  3.80145423e-02,
         6.86880350e-02,  1.53036006e-02, -1.03195058e-02,
        -4.88805249e-02, -6.08899724e-03, -2.44831592e-02,
         5.02196886e-02,  3.00412402e-02, -7.09355762e-03,
        -7.42765749e-03, -3.42276413e-03,  1.04430439e-02,
         2.95333024e-02,  6.53568059e-02,  6.96141645e-02,
         5.51910065e-02, -5.00645526e-02, -4.66123372e-02,
        -1.05388276e-02, -3.72130200e-02, -6.04823139e-03,
         2.21524686e-02, -3.91618907e-02,  6.89001232e-02,
      

In [56]:
# calculate cosine similarities and add them to the 

cosine_similarities = {}

for index, row in comparison_df.iterrows():
    video_id = row['video_id']
    video_sentence = row['video_sentences']
    #print(type(video_sentence))
    nyt_sentence = row['nyt_sentences']
    
    # calculate embeddings for video sentence
    video_embedding = embed([video_sentence])[0]    # not sure why??
    nyt_embedding = embed([nyt_sentence])[0]
    
    # calculate cosine similarity
    cosine_sim = cosineSimilarity(video_embedding, nyt_embedding)

    # add to dictionary
    cosine_similarities[video_id] = cosine_sim

cosine_similarities

{7273221955937914155: 0.17501032,
 7283080657893379334: 0.23219937,
 7285397643725983008: 0.22656588,
 7231292396573641991: 0.01899231,
 7235928808166100242: 0.0021175565}