# Hashtag Recommender System

## Objective:
This notebook walks through the complete pipeline for building a content-based recommender system using past Instagram data. The goal is to suggest relevant, high-performing hashtags for a given caption by analyzing text similarity and engagement metrics.

## 1. Load and Clean Data
- Parse date, remove nulls, extract hashtags
- Create engagement score

In [1]:
import pandas as pd
import re

df = pd.read_csv("../data/instagram-data.csv")

# Clean data
df['Date'] = pd.to_datetime(df['Date'])
df = df.dropna(subset = ['Caption']) # drop rows w/o captions
df.fillna(0, inplace = True) # fill NaNs with 0
df.drop_duplicates(inplace=True)

# Create engagement score. Here, it's Engagement Score = Likes + 2*Shares + 2*Comments + 1.5*Saves + 3*Follows
df['EngagementScore'] = (df['Likes'] + 2*df['Shares'] + 2*df['Comments'] + 1.5*df['Saves'] + 3*df['Follows'])

# Extract hashtags
def extract_hashtags(text):
    return re.findall(r'#(\w+)', str(text.lower()))
df['HashtagList'] = df['Hashtags'].apply(extract_hashtags) # Row by row application of hashtag extraction

# Display first few rows
print(df[['Date', 'Caption', 'HashtagList', 'EngagementScore']].head())

        Date                                            Caption  \
0 2021-12-10  Here are some of the most important data visua...   
1 2021-12-11  Here are some of the best data science project...   
2 2021-12-12  Learn how to train a machine learning model an...   
3 2021-12-13  Here’s how you can write a Python program to d...   
4 2021-12-14  Plotting annotations while visualizing your da...   

                                         HashtagList  EngagementScore  
0  [finance, money, business, investing, investme...            343.0  
1  [healthcare, health, covid, data, datascience,...            587.0  
2  [data, datascience, dataanalysis, dataanalytic...            252.5  
3  [python, pythonprogramming, pythonprojects, py...            529.0  
4  [datavisualization, datascience, data, dataana...            285.0  


## 2. TF-IDF Vectorization
- Vectorize captions
- Compute cosine similarity

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Create TF-IDF matrix from captions
tfidf = TfidfVectorizer(stop_words = 'english') # Init
tfidf_matrix = tfidf.fit_transform(df['Caption'])

# Compute cosine similarity between all posts
cos_sim_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)

## 3. Similarity Recommender and Output
- Loops through multiple Instagram posts to comute similarity-based hashtag recommendations using caption text and cosine similarity
- Filters similar posts by above-median engagement and exports the top ranked hashtags for the first 10 query posts to a CSV

In [None]:
top_n = 5  # Number of hashtags to recommend
min_engagement = df['EngagementScore'].median()  # Filter threshold
all_outputs = [] 

# Loop through the first 10 posts. This is the sample used for the Tableau output.
for idx in range(10):
    caption_text = df.loc[idx, 'Caption']
    print(f"\nQuery caption [{idx}]:", caption_text)

    # Step 1: Get similar posts
    # Each row in cos_sim_matrix[idx] represents similarity to another post
    similarities = list(enumerate(cos_sim_matrix[idx]))  # (post_index, similarity_score)

    # Sort by similarity score in descending order
    similarities = sorted(similarities, key=lambda x: x[1], reverse=True)

    # Remove the post itself from the list
    similar_posts = []
    for post_index, score in similarities:
        if post_index != idx:
            similar_posts.append((post_index, score))

    # Step 2: Gather recommended hashtags
    recommended_tags = []

    # Loop through similar posts to collect hashtags
    for post_index, score in similar_posts:
        # Only consider posts with good engagement
        if df.loc[post_index, 'EngagementScore'] >= min_engagement:
            tags = df.loc[post_index, 'HashtagList']
            recommended_tags.extend(tags)  # Add all tags from this post
            if len(recommended_tags) >= top_n:
                break

    # Remove duplicates while preserving order and limit to top_n =  5 (number of hashtags to recommend)
    seen = set() # List to store hashtags without duplicates
    unique_tags = [] # Set to track tags that have already been seen 
    for tag in recommended_tags:
        if tag not in seen:
            seen.add(tag)
            unique_tags.append(tag)
    recommended_tags = unique_tags[:top_n]

    # Step 3: Score the recommended hashtags
    for rank, tag in enumerate(recommended_tags, start = 1):
        # Find the similar posts that contributed to this hashtag and passed the engagement filter
        contributing_posts = []
        for post_index, score in similar_posts:
            has_tag = tag in df.loc[post_index, 'HashtagList']
            good_engagement = df.loc[post_index, 'EngagementScore'] >= min_engagement
            if has_tag and good_engagement:
                contributing_posts.append(post_index)

        # If it's a contriuting post, I get its scores
        if contributing_posts:
            source_post = contributing_posts[0]
            similarity_score = None
            for idx_, score in similar_posts:
                if idx_ == source_post:
                    similarity_score = score
                    break
            engagement_score = df.loc[source_post, 'EngagementScore']
        else:
            similarity_score = None
            engagement_score = None

        # Add the result for this hashtag
        all_outputs.append({
            'QueryPostIndex': idx,
            'QueryCaption': caption_text,
            'RecommendedHashtag': tag,
            'Rank': rank,
            'SimilarityScore': similarity_score,
            'EngagementScore': engagement_score
        })

# Step 4: Save the results
final_df = pd.DataFrame(all_outputs)
final_df.to_csv("../data/recommended_hashtags_multiple_posts.csv", index = False)


Query caption [0]: Here are some of the most important data visualizations that every Financial Data Analyst/Scientist should know.

Query caption [1]: Here are some of the best data science project ideas on healthcare. If you want to become a data science professional in the healthcare domain then you must try to work on these projects.

Query caption [2]: Learn how to train a machine learning model and giving inputs to your trained model to make predictions using Python.

Query caption [3]: Here’s how you can write a Python program to detect whether a sentence is a question or not. The idea here is to find the words that we see in the beginning of a question in the beginning of a sentence.

Query caption [4]: Plotting annotations while visualizing your data is considered good practice to make the graphs self-explanatory. Here is an example of how you can annotate a graph using Python.

Query caption [5]: Here are some of the most important soft skills that every data scientist shoul