<a href="https://colab.research.google.com/github/nailson-landim/1post/blob/main/ClassyIronMan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classy Ironman

Let's clasify Ironman tagged/words to know if they refer to Triathlo or anything else.

## NOTE.: For the statistics hardcore folks:

I'm not doing a rigourous statistics here. It's more like a quick and dirty experiment. to create a simple, cheap and effective classification based on Blusky posts


In [1]:
import re
import nltk
import spacy
import sklearn
import pandas as pd
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

try:
    nlp = spacy.load('en_core_web_sm')
    print("SpaCy 'en_core_web_sm' model loaded successfully.")
except OSError:
    print("SpaCy 'en_core_web_sm' model not found. Downloading...")
    spacy.cli.download('en_core_web_sm')
    nlp = spacy.load('en_core_web_sm')
    print("SpaCy 'en_core_web_sm' model downloaded and loaded successfully.")

SpaCy 'en_core_web_sm' model loaded successfully.


## Functions

In [2]:
nltk.download('stopwords')
nltk.download('wordnet')
print("NLTK 'stopwords' and 'wordnet' data downloaded successfully.")

NLTK 'stopwords' and 'wordnet' data downloaded successfully.


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [3]:
stop_words = set(stopwords.words('english'))

def combine_and_lowercase(row):
    combined_parts = []

    # Process 'text' column
    if pd.notna(row['text']) and row['text']:
        combined_parts.append(str(row['text']).lower())

    # Process 'alt_texts', 'tags', 'urls' columns
    for col in ['alt_texts', 'tags', 'urls']:
        if pd.notna(row[col]) and row[col]:
            try:
                # eval the string representation of a list and join elements
                list_items = eval(str(row[col]))
                if isinstance(list_items, list) and list_items:
                    combined_parts.append(' '.join(str(item).lower() for item in list_items))
            except (NameError, SyntaxError):
                # If not a list string, just treat as a regular string
                combined_parts.append(str(row[col]).lower())

    return ' '.join(combined_parts)

# Get English stopwords from NLTK

def preprocess_text(text):
    # Remove newlines and extra whitespace
    text = re.sub(r'\n', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()

    # Process with SpaCy for tokenization and lemmatization
    doc = nlp(text)

    # Lemmatize and remove stop words
    processed_tokens = [token.lemma_ for token in doc if token.text.lower() not in stop_words and token.is_alpha]

    return ' '.join(processed_tokens)

## Load our dataset from Google Drive...

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
!mkdir /content/datasets
!cp /content/drive/MyDrive/ClassyIronMan/data.csv /content/datasets/data.csv

## Take a peek on sample rows to start feature engineering


In [6]:
source_df = pd.read_csv('/content/datasets/data.csv')
source_sample = source_df.sample(20)
source_sample

Unnamed: 0,uri,text,alt_texts,tags,urls,mentions,langs,embed_title,embed_description,embed_url,rule,is_related
43,at://did:plc:leidqgx3be72rmeiwvdzvnes/app.bsky...,Belgian Brilliance ü§©\n\nJolien Vermeylen üáßüá™ ma...,[],"[""TongyeongWC"",""Triathlon"",""BeYourExtraordinar...",[],[],"[""en""]",,,,text:(?i)\btriathlon\b,True
1879,at://did:plc:bdg6sni7k7gq7hrgck6h3aky/app.bsky...,"Swim faster, come out of the water fresher, an...",[],[],"[""https://bit.ly/4qfxFWi""]",[],[],,,,text:(?i)\btriathlon\b,True
67,at://did:plc:jjhragb3uuqhgwoomqmw76fw/app.bsky...,üèÜ T100 WORLD CHAMPS RETURN üèÜ\n\nThe T100 autom...,[],[],"[""https://protriathletes.org/media-releases/pt...",[],"[""en""]",,,,url:(?i)\btriathlon\b,True
714,at://did:plc:leidqgx3be72rmeiwvdzvnes/app.bsky...,Who knew Carson had such good French üá´üá∑üòÜ\n\n#P...,[],"[""Paratrowthlon"",""WTPSYokohama""]",[],[],"[""en""]",,,,,True
431,at://did:plc:leidqgx3be72rmeiwvdzvnes/app.bsky...,Back to defend her crown! üëëüî•\n\nThrowback to B...,[],"[""Triathlon"",""TheFutureIsNow"",""BeYourExtraordi...",[],[],"[""en""]",,,,text:(?i)\btriathlon\b,True
1135,at://did:plc:leidqgx3be72rmeiwvdzvnes/app.bsky...,FULL ARTICLE: triathlon.org/news/matt-ha...,[],[],"[""https://triathlon.org/news/matt-hauser-deton...",[],"[""en""]",Matt Hauser detonates another dose of WTCS Ham...,"<p class=""MsoNormal"">Matt Hauser went into Sat...",https://triathlon.org/news/matt-hauser-detonat...,text:(?i)\btriathlon\b,True
1060,at://did:plc:nfjffyiuhjmbni7joain3urn/app.bsky...,"Her tech, now enhanced by magical elements, co...",[],[],[],[],"[""en""]",How 'Ironheart' Could Offer the Biggest Clue Y...,'Ironheart' picks up less than a year after th...,https://blackgirlnerds.com/how-ironheart-could...,text:(?i)\biron ?man\b,
1210,at://did:plc:5ekaaudi6pc22b4axp54hqre/app.bsky...,"Triathlon. Nice, terre d‚Äôexploit pour J√©r√©my B...",[],[],"[""https://www.europesays.com/fr/276260/""]",[],"[""fr-FR"",""fr""]","Triathlon. Nice, terre d‚Äôexploit pour J√©r√©my B...","...pour lire la suite, rejoignez notre communa...",https://www.europesays.com/fr/276260/,text:(?i)\btriathlon\b,
636,at://did:plc:leidqgx3be72rmeiwvdzvnes/app.bsky...,"With World Junior silver in 2024, duathlon bro...",[],"[""LA28"",""Olympics"",""Triathlon""]",[],[],"[""en""]",,,,text:(?i)\btriathlon\b,True
525,at://did:plc:5ekaaudi6pc22b4axp54hqre/app.bsky...,Un millier de triathl√®tes vont envahir le Mour...,[],[],"[""https://www.europesays.com/fr/32817/""]",[],"[""fr-FR"",""fr""]",Un millier de triathl√®tes vont envahir le Mour...,"Pour cette nouvelle √©dition, le triathlon de T...",https://www.europesays.com/fr/32817/,text:(?i)\btriathlon\b,


We have some interesting stuff, some data seems labeled and some isn't. it's at the `is_related` column. Let's split that data bewtween **labeled** and **unlabeled**, to have its stats, see labeling quality and keep exploring

In [7]:
# Filter out rows where 'is_related' is NaN to get labeled data
labeled_df = source_df.dropna(subset=['is_related']).copy()

# Convert 'is_related' to boolean type for accurate counting, if not already
labeled_df['is_related'] = labeled_df['is_related'].astype(bool)

# Peek at 5 random rows from the labeled data
print("Random 20 rows from labeled data:")
display(labeled_df.sample(20))

# Count the number of True/False values in 'is_related'
print("\nCounts of 'is_related' values:")
display(labeled_df['is_related'].value_counts())

Random 20 rows from labeled data:


Unnamed: 0,uri,text,alt_texts,tags,urls,mentions,langs,embed_title,embed_description,embed_url,rule,is_related
1373,at://did:plc:leidqgx3be72rmeiwvdzvnes/app.bsky...,UNSTOPPABLE! üí•\n\nYour men‚Äôs podium for the 20...,[],"[""WTCSFrenchRiviera"",""Triathlon"",""TheFutureIsN...",[],[],"[""en""]",,,,text:(?i)\btriathlon\b,True
924,at://did:plc:leidqgx3be72rmeiwvdzvnes/app.bsky...,Flying the flag high! üåç\n\nThe streets of Pont...,[],"[""MultisportWCH2025"",""WorldTriathlete"",""Pontev...",[],[],"[""en""]",,,,text:(?i)\btriathlon\b,True
2316,at://did:plc:rnpqtgtqz4vozbtlwmkak3vv/app.bsky...,One of my issues with the mcu is how the stori...,[],[],[],[],"[""en""]",,,,text:(?i)\biron ?man\b,False
758,at://did:plc:leidqgx3be72rmeiwvdzvnes/app.bsky...,FULL ARTICLE: triathlon.org/news/returni...,[],[],"[""https://triathlon.org/news/returning-women-f...",[],"[""en""]",Returning women face new rivals at Samarkand W...,"<p class=""MsoNormal"">After making a successful...",https://triathlon.org/news/returning-women-fac...,text:(?i)\btriathlon\b,True
2512,at://did:plc:j7cf7ocetbiy77l6jnpu4g5d/app.bsky...,"Iron Man came out in May of 2008, so we can as...",[],[],[],[],"[""en""]",,,,text:(?i)\biron ?man\b,False
2003,at://did:plc:bdg6sni7k7gq7hrgck6h3aky/app.bsky...,Thinking about tackling your first middle-dist...,[],[],[],[],[],,,,,True
2099,at://did:plc:aiszm5s7rajxmnj5t35tm4qh/app.bsky...,Thursday implies the existence of IronMan'sday.,[],[],[],[],"[""en""]",,,,text:(?i)\biron ?man\b,False
2494,at://did:plc:ut7rvmgv4xqvy7frpg4mh25d/app.bsky...,When I worked at a comic store I promise that ...,[],[],[],[],"[""en""]",,,,text:(?i)\biron ?man\b,False
249,at://did:plc:leidqgx3be72rmeiwvdzvnes/app.bsky...,"HANDOVER üëè\n\nFrom one athlete to another, the...",[],"[""WinterTri"",""WinterChamps""]",[],[],"[""en""]",,,,text:(?i)\btriathlon\b,True
204,at://did:plc:leidqgx3be72rmeiwvdzvnes/app.bsky...,"With a stacked elite women‚Äôs field, including ...",[],"[""NapierWC"",""Triathlon"",""TheFutureIsNow"",""BeYo...",[],[],"[""en""]",,,,text:(?i)\btriathlon\b,True



Counts of 'is_related' values:


Unnamed: 0_level_0,count
is_related,Unnamed: 1_level_1
True,1229
False,398


### **One note for later, there's an imbalance that needs to be managed.**

In [8]:
# Filter for rows where 'is_related' is NaN to get unlabeled data
unlabeled_df = source_df[source_df['is_related'].isna()].copy()

# Peek at 5 random rows from the unlabeled data
print("Random 20 rows from unlabeled data:")
display(unlabeled_df.sample(20))

# Count the number of unlabeled entries
print("\nTotal count of unlabeled entries:")
print(len(unlabeled_df))

Random 20 rows from unlabeled data:


Unnamed: 0,uri,text,alt_texts,tags,urls,mentions,langs,embed_title,embed_description,embed_url,rule,is_related
1001,at://did:plc:nayqkg37xukdr4kdnysauoc5/app.bsky...,I remember people losing their minds over Iron...,[],[],[],[],"[""en""]",,,,text:(?i)\biron ?man\b,
571,at://did:plc:bxbou45dh7mhgrkiefnp6jve/app.bsky...,VesselAlert\nMMSI: 209391000\nFlag: Cyprus\nSe...,[],"[""Cyprus"",""VesselAlert"",""MMSI"",""Flag"",""under_w...","[""https://frlradar.nl/ais/?mmsi=209391000""]",[],[],,,,text:(?i)\b70\.3\b,
1743,at://did:plc:ygl5byfvhwtcuumgc6p4hixy/app.bsky...,Hey everyone I need a hundred ninety American ...,[],[],"[""https://www.ebay.ca/itm/336261620604""]",[],"[""en""]","Phil Kessel Autographed Hotdog ""Ironman"" Stanl...",Find many great new & used options and get the...,https://www.ebay.ca/itm/336261620604,embed:title:(?i)\biron ?man\b,
1804,at://did:plc:qhpxt5rroy52adfjr77bwsfj/app.bsky...,'ZOOTR√ìPOLIS 2' ha superado a 'LOS MINIONS' co...,[],"[""Zootopia2"",""Minions"",""Zootr√≥polis2"",""LosMini...",[],[],"[""es""]",,,,text:(?i)\biron ?man\b,
2545,at://did:plc:a3kdw4qvxyy7igsstzua6h2k/app.bsky...,I DEMAND AN IRON MAN MATCH #AEWGrandSlam,[],"[""AEWGrandSlam""]",[],[],"[""en""]",,,,text:(?i)\biron ?man\b,
1172,at://did:plc:acoroaiidwdenonoztqbtlgp/app.bsky...,"Why, there's Iron Man's beard!",[],[],[],[],"[""en""]",,,,text:(?i)\biron ?man\b,
167,at://did:plc:wfy4l5sl22lgb2lakf6kwqvw/app.bsky...,"It's real small stakes stuff, and also an ice ...",[],[],[],[],"[""en""]",,,,text:(?i)\biron ?man\b,
2350,at://did:plc:3vem3es6gchubf5jklnflxr2/app.bsky...,üî• Unseen Forces Feraligatr EX #103 deal (UK) üî•...,"[""unseen forces Feraligatr EX #103""]",[],"[""https://www.ebay.co.uk/itm/389345203999?camp...",[],"[""en""]",,,,text:(?i)\b70\.3\b,
745,at://did:plc:bxbou45dh7mhgrkiefnp6jve/app.bsky...,VesselAlert\nName: ELBSPRINTER\nMMSI: 21016700...,[],"[""210167000"",""Cyprus"",""C4RJ2"",""VesselAlert"",""M...","[""https://frlradar.nl/ais/?mmsi=210167000""]",[],[],,,,text:(?i)\b70\.3\b,
554,at://did:plc:5ekaaudi6pc22b4axp54hqre/app.bsky...,Un policier municipal d√©c√®de pendant le triath...,[],[],"[""https://www.europesays.com/fr/46327/""]",[],"[""fr-FR"",""fr""]",Un policier municipal d√©c√®de pendant le triath...,Un concurrent de 53¬†ans a trouv√© la mort lors ...,https://www.europesays.com/fr/46327/,text:(?i)\btriathlon\b,



Total count of unlabeled entries:
1016


So far, all columns are straightforward, but one came to my attention, which is **mentions**. Does it have anything useful? let's check it out

In [9]:
# Filter out rows where 'mentions' is empty or NaN
mentions_df = source_df[source_df['mentions'].apply(lambda x: len(eval(x)) > 0 if isinstance(x, str) else False)].copy()

# Peek at 3 random rows from the mentions data, or fewer if less than 3 are available
num_samples = min(len(mentions_df), 3)
print(f"Random {num_samples} rows from data with mentions:")
if num_samples > 0:
    display(mentions_df.sample(num_samples))
else:
    print("No rows with mentions found.")

Random 3 rows from data with mentions:


Unnamed: 0,uri,text,alt_texts,tags,urls,mentions,langs,embed_title,embed_description,embed_url,rule,is_related
822,at://did:plc:3272gdrjsuikiff7qsgokgas/app.bsky...,"Ok, 5 weeks to the first full-distance triathl...",[],"[""ChallengeRoth""]",[],"[""did:plc:qcbkud2rb5mp3petgcof47ps""]","[""en""]",,,,did:plc:qcbkud2rb5mp3petgcof47ps,True
1203,at://did:plc:leidqgx3be72rmeiwvdzvnes/app.bsky...,A performance to remember ‚ú®\n\nKate Waugh üá¨üáß s...,[],[],"[""https://youtube.com/@worldtriathlon"",""https:...","[""did:plc:jjhragb3uuqhgwoomqmw76fw""]","[""en""]",,,,did:plc:jjhragb3uuqhgwoomqmw76fw,True
1099,at://did:plc:3272gdrjsuikiff7qsgokgas/app.bsky...,#ChallengeRoth is the best triathlon for sure!...,[],"[""ChallengeRoth""]",[],"[""did:plc:qcbkud2rb5mp3petgcof47ps""]","[""en""]",,,,did:plc:qcbkud2rb5mp3petgcof47ps,True


As seen, it's a "URL", so far, for the purpose of this classification I won't follow them. (It could lead latter to the relevant posts, but that's for another ML work)

**Choosing features/columns**

I want relevant data for classification. a UUID-styled URL doesn't have anything to me. But maybe a website URL does. So I'll separate the columns I want for now and the discarded ones.

**Kept**:

- text
- alt-texts
- tags
- urls

**Discarded**:

- uri
- rule
- langs

In [10]:
kept_columns_labeled = ['text', 'alt_texts', 'tags', 'urls', 'is_related']
kept_columns_unlabeled = ['text', 'alt_texts', 'tags', 'urls']

# Apply to labeled_df
labeled_df_processed = labeled_df[kept_columns_labeled].copy()
print("Sample of processed labeled_df (30 rows):")
display(labeled_df_processed.sample(30))

# Apply to unlabeled_df
unlabeled_df_processed = unlabeled_df[kept_columns_unlabeled].copy()
print("\nSample of processed unlabeled_df (30 rows):")
display(unlabeled_df_processed.sample(30))

Sample of processed labeled_df (30 rows):


Unnamed: 0,text,alt_texts,tags,urls,is_related
2452,youtu.be/jGimf3hc_aA?...\n\nIf superman captai...,[],[],"[""https://youtu.be/jGimf3hc_aA?si=OlM5feasWJ_f...",False
2313,Work in progress inking Jack Kirby‚Äôs Iron Man ...,[],"[""ironman"",""comiccommission"",""jackkirby"",""marv...",[],False
2438,youtu.be/jGimf3hc_aA?...\n\nIf superman captai...,[],[],"[""https://youtu.be/jGimf3hc_aA?si=OlM5feasWJ_f...",False
1692,READ MORE: triathlon.org/news/miyazak...,[],[],"[""https://triathlon.org/news/miyazaki-welcomes...",True
998,They brought the fire! üî• \n\nIt was high-stake...,[],"[""MultisportWCH2025"",""Pontevedra2025"",""CrossTr...",[],True
1981,"You sign up for a race, but you don‚Äôt always k...",[],[],"[""https://bit.ly/4jFdL4v""]",True
2092,VesselAlert\nName: CATCH\nMMSI: 244129807\nCal...,[],"[""moored"",""VesselAlert"",""MMSI"",""Flag"",""2441298...","[""https://frlradar.nl/ais/?mmsi=244129807""]",False
152,üî• The heat is ON! üî•\n\nWhich elite women will ...,[],"[""WTCSAbuDhabi"",""Triathlon"",""TheFutureIsNow"",""...","[""https://TriathlonLive.tv""]",True
2179,BLACK SABBATH \nParanoid \n\nOzzy Osbourne \nT...,[],[],[],False
2254,"After back-to-back swim, bike, and run session...",[],[],"[""https://bit.ly/4qE6jsm""]",True



Sample of processed unlabeled_df (30 rows):


Unnamed: 0,text,alt_texts,tags,urls
2589,30 Days of Favourite Duos \n\nSuperhero or Vil...,[],[],[]
311,He's pretty much just launching faulty and bad...,[],[],[]
405,Was this the first one that had that screen fi...,[],[],[]
2540,truke or no truke,"[""marvel rivals sexuality tier list\n\nstraigh...",[],[]
448,I may get drunk enough tonight to randomly pur...,[],[],"[""https://blacksabbathapparelshop.com/products..."
1209,"What is our operational definition of ""superhe...",[],[],[]
1897,'AVATAR: FUEGO Y CENIZA' ha superado a 'IRON M...,[],"[""AvatarFireAndAsh"",""IronMan3"",""Avatar3"",""Taqu...",[]
1056,funny to think how iron man trilogy literally ...,[],[],[]
1680,VesselAlert\nName: THEMIS\nMMSI: 538004153\nCa...,[],"[""THEMIS"",""Marshall_Islands"",""VesselAlert"",""V7...","[""https://frlradar.nl/ais/?mmsi=538004153""]"
232,Yeah I figured. I was just wondering if there'...,[],[],[]


Looking at the data, my gut feeling says that *tags* quantity are related, maybe they are an indicator of triathlo related, but that can be misleading. Anyway computing that won't do a harm.

BTW, I'll do it to all three that are listg of items, just to know IF their qty IS a useful feature or not. Sorting can highlight it.

In [11]:
columns_to_count = ['alt_texts', 'tags', 'urls']

for col in columns_to_count:
    def get_list_length(item):
        if isinstance(item, str):
            try:
                evaluated_item = eval(item)
                if isinstance(evaluated_item, list):
                    return len(evaluated_item)
            except (NameError, SyntaxError):
                pass
        return 0
    labeled_df_processed[f'{col}_count'] = labeled_df_processed[col].apply(get_list_length)
    unlabeled_df_processed[f'{col}_count'] = unlabeled_df_processed[col].apply(get_list_length)

print("Sample of labeled_df_processed with new count columns (10 rows):")
display(labeled_df_processed.sample(10))

print("\nSample of unlabeled_df_processed with new count columns (10 rows):")
display(unlabeled_df_processed.sample(10))



Sample of labeled_df_processed with new count columns (10 rows):


Unnamed: 0,text,alt_texts,tags,urls,is_related,alt_texts_count,tags_count,urls_count
2399,Iron Man by Black Sabbath: the story behind th...,[],[],"[""https://www.europesays.com/uk/764054/""]",False,0,0,1
2247,Iron Man Records - Immanentize The Eschaton (H...,[],"[""music"",""musicians"",""hoodie""]","[""https://buff.ly/PfSPZJ9""]",False,0,3,1
949,FULL ARTICLE: triathlon.org/news/benjami...,[],[],"[""https://triathlon.org/news/benjamin-choquert...",True,0,0,1
350,Cathia Schar dominated on two wheels at the 20...,[],"[""LievinWC"",""Triathlon"",""TheFutureIsNow"",""BeYo...",[],True,0,4,0
313,Tato! Tem muita gente no triathlon que curte.\...,[],[],[],True,0,0,0
871,üì∫ Catch all the action LIVE on TriathlonLive.t...,[],[],"[""https://TriathlonLive.tv""]",True,0,0,1
2183,From Elsa Bloodstone's debut to the final Dare...,[],[],[],False,0,0,0
2091,dominador > dominatriz,[],[],[],True,0,0,0
296,"Fresh off a tough training camp, Zuzana Michal...",[],"[""Triathlon"",""TheFutureIsNow"",""BeYourExtraordi...",[],True,0,3,0
262,World Cup racing at its finest! üëåüî•\n\nNapier‚Äôs...,[],"[""WC"",""NapierWC"",""Triathlon"",""TheFutureIsNow"",...","[""https://TriathlonLive.tv""]",True,0,5,1



Sample of unlabeled_df_processed with new count columns (10 rows):


Unnamed: 0,text,alt_texts,tags,urls,alt_texts_count,tags_count,urls_count
462,Men's triathlon at 2024 Summer Olympics in Par...,[],"[""Olimpiada2024""]","[""https://fefd.link/leX1Z""]",0,1,1
1779,Looking forward to all of the crimes Iron Man ...,[],[],[],0,0,0
2573,"in your main event, it's the 1 hour anything g...",[],[],[],0,0,0
532,VesselAlert\nMMSI: 209391000\nFlag: Cyprus\nSe...,[],"[""Cyprus"",""VesselAlert"",""MMSI"",""Flag"",""under_w...","[""https://frlradar.nl/ais/?mmsi=209391000""]",0,10,1
452,"That meme with the wolf saying ‚ÄúLook at me, I‚Äô...",[],[],[],0,0,0
649,The problem with superhero movies is that in m...,[],[],[],0,0,0
985,seriously \n\nif you're never seen the iron ma...,[],"[""AEWDynamite""]",[],0,1,0
315,We are watching the White Lotus S3 and they ha...,[],[],[],0,0,0
1577,"On this day 1987, Doug Jarvis' ironman streak ...","[""A photograph of a man wearing a green hockey...",[],[],1,0,0
213,17. Iron Man 2,[],[],[],0,0,0


In [12]:
# Calculate the sum of counts for each DataFrame
labeled_df_processed['total_counts'] = labeled_df_processed['alt_texts_count'] + labeled_df_processed['tags_count'] + labeled_df_processed['urls_count']
unlabeled_df_processed['total_counts'] = unlabeled_df_processed['alt_texts_count'] + unlabeled_df_processed['tags_count'] + unlabeled_df_processed['urls_count']

# Sort labeled_df_processed by total_counts
sorted_labeled_most_tagged = labeled_df_processed.sort_values(by='total_counts', ascending=False)
sorted_labeled_least_tagged = labeled_df_processed.sort_values(by='total_counts', ascending=True)

print("\n--- Labeled Data: Most Tagged (10 rows) ---")
display(sorted_labeled_most_tagged.head(10))

print("\n--- Labeled Data: Least Tagged (10 rows) ---")
display(sorted_labeled_least_tagged.head(10))

# Sort unlabeled_df_processed by total_counts
sorted_unlabeled_most_tagged = unlabeled_df_processed.sort_values(by='total_counts', ascending=False)
sorted_unlabeled_least_tagged = unlabeled_df_processed.sort_values(by='total_counts', ascending=True)

print("\n--- Unlabeled Data: Most Tagged (10 rows) ---")
display(sorted_unlabeled_most_tagged.head(10))

print("\n--- Unlabeled Data: Least Tagged (10 rows) ---")
display(sorted_unlabeled_least_tagged.head(10))


--- Labeled Data: Most Tagged (10 rows) ---


Unnamed: 0,text,alt_texts,tags,urls,is_related,alt_texts_count,tags_count,urls_count,total_counts
2243,Roadmap season 6.5 #marvelrivals #pc #xbox #pl...,[],"[""marvelrivals"",""pc"",""xbox"",""playstation"",""mar...",[],False,0,22,0,22
1978,VesselAlert NEW FIRST Observation\nName: CATCH...,[],"[""moored"",""VesselAlert"",""FIRST"",""MMSI"",""Flag"",...","[""https://frlradar.nl/ais/?mmsi=244129807""]",False,0,14,1,15
2079,VesselAlert\nName: CARPE DIEM\nMMSI: 226018820...,[],"[""France"",""CARPE_DIEM"",""VesselAlert"",""MMSI"",""F...","[""https://frlradar.nl/ais/?mmsi=226018820""]",False,0,13,1,14
2044,VesselAlert\nName: BALTIC SHARK\nMMSI: 3049320...,[],"[""304932000"",""VesselAlert"",""Antigua_Barbuda"",""...","[""https://frlradar.nl/ais/?mmsi=304932000""]",False,0,13,1,14
1995,VesselAlert\nName: CATCH\nMMSI: 244129807\nCal...,[],"[""moored"",""VesselAlert"",""MMSI"",""Flag"",""2441298...","[""https://frlradar.nl/ais/?mmsi=244129807""]",False,0,13,1,14
1947,VesselAlert\nName: ADRIANA\nMMSI: 244710215\nC...,[],"[""244710215"",""PD3869"",""VesselAlert"",""MMSI"",""Fl...","[""https://frlradar.nl/ais/?mmsi=244710215""]",False,0,13,1,14
1967,VesselAlert\nName: LAGUNE\nMMSI: 211169360\nCa...,[],"[""LAGUNE"",""VesselAlert"",""Germany"",""MMSI"",""DA50...","[""https://frlradar.nl/ais/?mmsi=211169360""]",False,0,13,1,14
2009,VesselAlert\nName: EMS HIGHWAY\nMMSI: 21288200...,[],"[""Cyprus"",""VesselAlert"",""MMSI"",""EMS_HIGHWAY"",""...","[""https://frlradar.nl/ais/?mmsi=212882000""]",False,0,13,1,14
1989,VesselAlert\nName: CATCH\nMMSI: 244129807\nCal...,[],"[""moored"",""VesselAlert"",""MMSI"",""Flag"",""2441298...","[""https://frlradar.nl/ais/?mmsi=244129807""]",False,0,13,1,14
2045,VesselAlert\nName: HOLLANDS DIEP\nMMSI: 244630...,[],"[""Tanker"",""HOLLANDS_DIEP"",""VesselAlert"",""PE456...","[""https://frlradar.nl/ais/?mmsi=244630100""]",False,0,13,1,14



--- Labeled Data: Least Tagged (10 rows) ---


Unnamed: 0,text,alt_texts,tags,urls,is_related,alt_texts_count,tags_count,urls_count,total_counts
2613,Holy cow! Just reading this reminded me of the...,[],[],[],False,0,0,0,0
2608,Feeling like Tony Stark first thing in the mor...,[],[],[],False,0,0,0,0
2523,For the triathlon...I'll get my coat....,[],[],[],True,0,0,0,0
2522,Well... üôà,[],[],[],False,0,0,0,0
2521,And retconned in a whole bunch of deaths while...,[],[],[],False,0,0,0,0
2520,Bought this off the rack and regretted it imme...,[],[],[],False,0,0,0,0
26,"The Brit leaves an unparalleled legacy, becomi...",[],[],[],True,0,0,0,0
21,The first ever T100 World Champions üèÜüèÜ\n\nTayl...,[],[],[],True,0,0,0,0
30,A hero retires ü•π\n\nThanks for the incredible ...,[],[],[],True,0,0,0,0
29,"Thank you, Alistair, for a career that truly w...",[],[],[],True,0,0,0,0



--- Unlabeled Data: Most Tagged (10 rows) ---


Unnamed: 0,text,alt_texts,tags,urls,alt_texts_count,tags_count,urls_count,total_counts
1887,VesselAlert NEW FIRST Observation\nName: NORTH...,[],"[""Tanker"",""PH2471"",""VesselAlert"",""FIRST"",""MMSI...","[""https://frlradar.nl/ais/?mmsi=244094246""]",0,15,1,16
479,VesselAlert NEW FIRST Observation\nName: HJORD...,[],"[""Finland"",""VesselAlert"",""FIRST"",""MMSI"",""Flag""...","[""https://frlradar.nl/ais/?mmsi=230351000""]",0,15,1,16
1837,VesselAlert NEW FIRST Observation\nName: AMARA...,[],"[""Tanker"",""AMARANTH"",""Cyprus"",""VesselAlert"",""F...","[""https://frlradar.nl/ais/?mmsi=209507000""]",0,15,1,16
631,TropoAlert - Max Dist. = 70.8 nm\nVesselAlert\...,[],"[""moored"",""VesselAlert"",""WIND_ENERGY"",""MMSI"",""...","[""https://frlradar.nl/ais/?mmsi=219032199""]",0,14,1,15
630,TropoAlert - Max Dist. = 70.8 nm\nVesselAlert\...,[],"[""moored"",""VesselAlert"",""WIND_ENERGY"",""MMSI"",""...","[""https://frlradar.nl/ais/?mmsi=219032199""]",0,14,1,15
594,TropoAlert - Max Dist. = 70.8 nm\nVesselAlert\...,[],"[""moored"",""VesselAlert"",""WIND_ENERGY"",""MMSI"",""...","[""https://frlradar.nl/ais/?mmsi=219032199""]",0,14,1,15
626,TropoAlert - Max Dist. = 70.8 nm\nVesselAlert\...,[],"[""moored"",""VesselAlert"",""WIND_ENERGY"",""MMSI"",""...","[""https://frlradar.nl/ais/?mmsi=219032199""]",0,14,1,15
627,TropoAlert - Max Dist. = 70.8 nm\nVesselAlert\...,[],"[""moored"",""VesselAlert"",""WIND_ENERGY"",""MMSI"",""...","[""https://frlradar.nl/ais/?mmsi=219032199""]",0,14,1,15
628,TropoAlert - Max Dist. = 70.8 nm\nVesselAlert\...,[],"[""moored"",""VesselAlert"",""WIND_ENERGY"",""MMSI"",""...","[""https://frlradar.nl/ais/?mmsi=219032199""]",0,14,1,15
619,TropoAlert - Max Dist. = 70.8 nm\nVesselAlert\...,[],"[""moored"",""VesselAlert"",""WIND_ENERGY"",""MMSI"",""...","[""https://frlradar.nl/ais/?mmsi=219032199""]",0,14,1,15



--- Unlabeled Data: Least Tagged (10 rows) ---


Unnamed: 0,text,alt_texts,tags,urls,alt_texts_count,tags_count,urls_count,total_counts
2627,It‚Äôs not even an unusual word though! Triathlo...,[],[],[],0,0,0,0
277,you might be able to take out a certain billio...,[],[],[],0,0,0,0
276,"Nobody knows I'm playing life on Ironman, but ...",[],[],[],0,0,0,0
272,"that's a machine right there! half snail, half...",[],[],[],0,0,0,0
260,"Yeah, the Iron Man's are up there.",[],[],[],0,0,0,0
247,It is OKAY for a 40 year old defenseman to NOT...,[],[],[],0,0,0,0
246,Things I like about the cover to IRON MAN #46:...,[],[],[],0,0,0,0
245,That cover to Iron Man #45 is a real bait-and-...,[],[],[],0,0,0,0
241,I lived/worked there for a few months. \n\nDid...,[],[],[],0,0,0,0
177,elon musk is basically suicide bombing the ent...,[],[],[],0,0,0,0


It doesn't seem that post count has to do with triathlo. there are posts about weather with lots of tags... that CAN be a feature, but let's look more at it, checking for the True and False splitted, ordered with most and least tags counts

In [13]:
# Filter labeled data for 'is_related == True'
labeled_true = labeled_df_processed[labeled_df_processed['is_related'] == True].copy()

# Filter labeled data for 'is_related == False'
labeled_false = labeled_df_processed[labeled_df_processed['is_related'] == False].copy()

print("--- Labeled Data (is_related=True): Most Tagged (10 rows) ---")
display(labeled_true.sort_values(by='total_counts', ascending=False).head(10))

print("\n--- Labeled Data (is_related=True): Least Tagged (10 rows) ---")
display(labeled_true.sort_values(by='total_counts', ascending=True).head(10))

print("\n--- Labeled Data (is_related=False): Most Tagged (10 rows) ---")
display(labeled_false.sort_values(by='total_counts', ascending=False).head(10))

print("\n--- Labeled Data (is_related=False): Least Tagged (10 rows) ---")
display(labeled_false.sort_values(by='total_counts', ascending=True).head(10))

--- Labeled Data (is_related=True): Most Tagged (10 rows) ---


Unnamed: 0,text,alt_texts,tags,urls,is_related,alt_texts_count,tags_count,urls_count,total_counts
206,The FIRST women‚Äôs #WC line up of 2025! üí•\n\nCh...,[],"[""WC"",""NapierWC"",""NapierWC"",""Triathlon"",""TheFu...","[""https://TriathlonLive.tv""]",True,0,6,1,7
487,Sport at its best is sport played true ü§ù‚ú®\n\nF...,[],"[""OnePlayTrueTeam"",""PlayTrueDay"",""CleanSport"",...","[""http://wada-ama.org/en""]",True,0,6,1,7
1232,Big drop pending ‚è≥\n\nDare to Dream: The Next ...,[],"[""Triathlon"",""DareToDream"",""TheNextMove"",""TheF...","[""https://TriathlonLive.tv"",""https://youtube.c...",True,0,4,2,6
1371,IT‚ÄôS RACE DAY‚Ä¶ üèÜ\n\n#WTCSFrenchRiviera has arr...,[],"[""WTCSFrenchRiviera"",""Triathlon"",""TheFutureIsN...","[""https://TriathlonLive.tv"",""https://shorturl....",True,0,4,2,6
964,It‚Äôs #OlympicDay! üéâ\n\nMovement is more than s...,[],"[""OlympicDay"",""LetsMove"",""Triathlon"",""Olympics...","[""https://shorturl.at/mrNFD""]",True,0,5,1,6
965,It‚Äôs #OlympicDay! üéâ\n\nMovement is more than s...,[],"[""OlympicDay"",""LetsMove"",""Triathlon"",""Olympics...","[""https://shorturl.at/mrNFD""]",True,0,5,1,6
52,"Inspiring the World ‚ú®üåç\n\nThis summer, the #Pa...",[],"[""Paris2024"",""ParalympicTRI"",""Paralympics"",""Pa...",[],True,0,6,0,6
1358,Competing with the very best üí™\n\nEpisode 4Ô∏è‚É£ ...,[],"[""DareToDream"",""TheNextMove"",""HotShot"",""Triath...","[""https://TriathlonLive.tv"",""https://shorturl....",True,0,4,2,6
209,Coming in hot! üî•\n\nThis men‚Äôs 2025 #NapierWC ...,[],"[""NapierWC"",""NapierWC"",""Triathlon"",""TheFutureI...","[""https://TriathlonLive.tv""]",True,0,5,1,6
262,World Cup racing at its finest! üëåüî•\n\nNapier‚Äôs...,[],"[""WC"",""NapierWC"",""Triathlon"",""TheFutureIsNow"",...","[""https://TriathlonLive.tv""]",True,0,5,1,6



--- Labeled Data (is_related=True): Least Tagged (10 rows) ---


Unnamed: 0,text,alt_texts,tags,urls,is_related,alt_texts_count,tags_count,urls_count,total_counts
0,I am new here and don‚Äôt know anyone yet. Let‚Äôs...,[],[],[],True,0,0,0,0
2226,A new triathlon. Brilliant,[],[],[],True,0,0,0,0
2235,Como los que aprovechan la baja de paternidad ...,[],[],[],True,0,0,0,0
2237,"Como dir√≠a @elmundotoday.com , aprovech√≥ el vo...",[],[],[],True,0,0,0,0
2245,We need to redo English because there is no wa...,[],[],[],True,0,0,0,0
2248,There should be a triathlon where the do the s...,[],[],[],True,0,0,0,0
2250,"Wollte an einem Triathlon teilnehmen, bin aber...",[],[],[],True,0,0,0,0
2266,Every type of bike racing I‚Äôve done has a high...,[],[],[],True,0,0,0,0
2277,RIP HannahHenry a Canadian triathlete tragical...,[],[],[],True,0,0,0,0
2279,Maybe Pam Bondi is a triathlete,[],[],[],True,0,0,0,0



--- Labeled Data (is_related=False): Most Tagged (10 rows) ---


Unnamed: 0,text,alt_texts,tags,urls,is_related,alt_texts_count,tags_count,urls_count,total_counts
2243,Roadmap season 6.5 #marvelrivals #pc #xbox #pl...,[],"[""marvelrivals"",""pc"",""xbox"",""playstation"",""mar...",[],False,0,22,0,22
1978,VesselAlert NEW FIRST Observation\nName: CATCH...,[],"[""moored"",""VesselAlert"",""FIRST"",""MMSI"",""Flag"",...","[""https://frlradar.nl/ais/?mmsi=244129807""]",False,0,14,1,15
1995,VesselAlert\nName: CATCH\nMMSI: 244129807\nCal...,[],"[""moored"",""VesselAlert"",""MMSI"",""Flag"",""2441298...","[""https://frlradar.nl/ais/?mmsi=244129807""]",False,0,13,1,14
2009,VesselAlert\nName: EMS HIGHWAY\nMMSI: 21288200...,[],"[""Cyprus"",""VesselAlert"",""MMSI"",""EMS_HIGHWAY"",""...","[""https://frlradar.nl/ais/?mmsi=212882000""]",False,0,13,1,14
2045,VesselAlert\nName: HOLLANDS DIEP\nMMSI: 244630...,[],"[""Tanker"",""HOLLANDS_DIEP"",""VesselAlert"",""PE456...","[""https://frlradar.nl/ais/?mmsi=244630100""]",False,0,13,1,14
2044,VesselAlert\nName: BALTIC SHARK\nMMSI: 3049320...,[],"[""304932000"",""VesselAlert"",""Antigua_Barbuda"",""...","[""https://frlradar.nl/ais/?mmsi=304932000""]",False,0,13,1,14
1989,VesselAlert\nName: CATCH\nMMSI: 244129807\nCal...,[],"[""moored"",""VesselAlert"",""MMSI"",""Flag"",""2441298...","[""https://frlradar.nl/ais/?mmsi=244129807""]",False,0,13,1,14
1967,VesselAlert\nName: LAGUNE\nMMSI: 211169360\nCa...,[],"[""LAGUNE"",""VesselAlert"",""Germany"",""MMSI"",""DA50...","[""https://frlradar.nl/ais/?mmsi=211169360""]",False,0,13,1,14
2079,VesselAlert\nName: CARPE DIEM\nMMSI: 226018820...,[],"[""France"",""CARPE_DIEM"",""VesselAlert"",""MMSI"",""F...","[""https://frlradar.nl/ais/?mmsi=226018820""]",False,0,13,1,14
1947,VesselAlert\nName: ADRIANA\nMMSI: 244710215\nC...,[],"[""244710215"",""PD3869"",""VesselAlert"",""MMSI"",""Fl...","[""https://frlradar.nl/ais/?mmsi=244710215""]",False,0,13,1,14



--- Labeled Data (is_related=False): Least Tagged (10 rows) ---


Unnamed: 0,text,alt_texts,tags,urls,is_related,alt_texts_count,tags_count,urls_count,total_counts
2621,like it's not a crime to be annoying. i'm the ...,[],[],[],False,0,0,0,0
1929,Do you think Iron Man‚Äôs favorite pizza is Pepp...,[],[],[],False,0,0,0,0
2616,,[],[],[],False,0,0,0,0
2613,Holy cow! Just reading this reminded me of the...,[],[],[],False,0,0,0,0
1934,consistently trying to tell you how important ...,[],[],[],False,0,0,0,0
1935,You just wanted to be able to recreate House ...,[],[],[],False,0,0,0,0
1962,Just me casually punishing an Iron Man for thi...,[],[],[],False,0,0,0,0
1969,I sincerely think that if Iron Man had been re...,[],[],[],False,0,0,0,0
1975,Sounds like you fucking iron man til the docto...,[],[],[],False,0,0,0,0
2487,Robert Downey Jr. shares new photos for Valent...,[],[],[],False,0,0,0,0


It seems there's a standard. With 5-6 tags for the True ones and 13-15 ones when there are tags. That's not a criterium per se, but an relationship SEEMS to exist.

With that done, let's make one action, which is join together text, alt_texts, tags and urls. AS it were a post


In [14]:
labeled_posts_df = labeled_df_processed.copy()
unlabeled_posts_df = unlabeled_df_processed.copy()

In [15]:
# Apply the function to labeled_df_processed
labeled_posts_df['post'] = labeled_df_processed.apply(combine_and_lowercase, axis=1)

# Apply the function to unlabeled_df_processed
unlabeled_posts_df['post'] = unlabeled_df_processed.apply(combine_and_lowercase, axis=1)

labeled_posts_df.drop(columns=['text', 'alt_texts', 'tags', 'urls'], inplace=True)
unlabeled_posts_df.drop(columns=['text', 'alt_texts', 'tags', 'urls'], inplace=True)

print("Sample of labeled_df_processed with new 'processed_text' column (10 rows):")
display(labeled_posts_df.sample(10))

print("\nSample of unlabeled_df_processed with new 'processed_text' column (10 rows):")
display(unlabeled_posts_df.sample(10))

Sample of labeled_df_processed with new 'processed_text' column (10 rows):


Unnamed: 0,is_related,alt_texts_count,tags_count,urls_count,total_counts,post
519,True,0,4,0,4,lisa tertsch leading the line! üí™üá©üá™\n\nas we ge...
1325,True,0,0,0,0,
1326,True,0,4,0,4,#wtcskarlovyvary üá®üáø could be the toughest test...
911,True,0,4,0,4,not long until we‚Äôre on that start line! ‚è≥üá™üá∏\n...
2397,False,0,0,0,0,the end of pre wandavision mcu fandom started ...
1336,True,0,3,0,3,the boys are here üî•\n\ncan alex yee or hayden ...
1380,True,0,4,1,5,men‚Äôs 2025 #wtcsfrenchriviera results! üëâ\n\nmi...
1811,True,0,0,1,1,from better breathing in the swim to absorbing...
1381,True,0,0,0,0,
268,True,0,4,1,5,"napier called, and these men answered with spe..."



Sample of unlabeled_df_processed with new 'processed_text' column (10 rows):


Unnamed: 0,alt_texts_count,tags_count,urls_count,total_counts,post
973,0,0,0,0,tetsuo: the iron man (1989)
1283,0,13,1,14,vesselalert\nname: hoegh sunrise\nmmsi: 258192...
1835,0,13,1,14,vesselalert\nname: mirva vg\nmmsi: 230662000\n...
260,0,0,0,0,"yeah, the iron man's are up there."
849,0,0,1,1,"nage en eau libre : qui est marc stettler, ¬´¬†l..."
1790,0,13,1,14,vesselalert\nname: hafnia leopard\nmmsi: 56386...
1131,0,0,1,1,yes! also great in this classic movie!\n\niron...
1897,0,5,0,5,'avatar: fuego y ceniza' ha superado a 'iron m...
1176,0,0,1,1,youtu.be/qrcyjjq0jhg?... https://youtu.be/qrcy...
1157,0,0,0,0,trump is epstein‚Äôs iron man\nhas he lost his m...


In [16]:
labeled_posts_df['processed_post'] = labeled_posts_df['post'].apply(preprocess_text)
unlabeled_posts_df['processed_post'] = unlabeled_posts_df['post'].apply(preprocess_text)

print("Preprocessing applied to 'post' column in both DataFrames.")

Preprocessing applied to 'post' column in both DataFrames.


In [17]:
print("Sample of labeled_posts_df with original 'post' and 'processed_post' columns:")
display(labeled_posts_df.sample(5))

print("\nSample of unlabeled_posts_df with original 'post' and 'processed_post' columns:")
display(unlabeled_posts_df.sample(5))

Sample of labeled_posts_df with original 'post' and 'processed_post' columns:


Unnamed: 0,is_related,alt_texts_count,tags_count,urls_count,total_counts,post,processed_post
126,True,0,3,0,3,the two-week countdown is about to begin!\n\ni...,two week countdown begin almost go time future...
25,True,0,3,0,3,"the end of an era üêêüá¨üáß\n\ntoday, we celebrate o...",end era today celebrate one great triathlete t...
1957,True,0,0,1,1,let‚Äôs be honest - how often do we skip stretch...,let honest often skip stretching regret later ...
1221,True,0,0,0,0,follow the leader!\n\nat exactly half way thro...,follow leader exactly half way world triathlon...
2022,True,0,0,1,1,bike pedals seem simple until they‚Äôre not. fro...,bike pedal seem simple flat clipless float hea...



Sample of unlabeled_posts_df with original 'post' and 'processed_post' columns:


Unnamed: 0,alt_texts_count,tags_count,urls_count,total_counts,post,processed_post
1744,0,0,0,0,iron man by gerald parel,iron man gerald parel
1628,0,1,1,2,german triathlete anne reischmann participates...,german triathlete anne reischmann participate ...
1792,0,14,1,15,tropoalert - max dist. = 70.3 nm\nvesselalert\...,tropoalert max dist nm vesselalert name hafnia...
330,0,0,0,0,i really think people have memory-holed how mu...,really think people memory hole much soft powe...
782,0,0,0,0,i may have doomed myself. i just gave my d&d p...,may doom give player veritable cornucopia diff...


## WIP

The current lemmatization got rid of emojis and URLs, which I want them in another fashion as they seem useful to me.
