# Social Media Labeled Dataset

This notebook will cover the cleaning and labelling of our social media dataset. We are using the [Instagram Images with Captions](https://www.kaggle.com/datasets/prithvijaunjale/instagram-images-with-captions) dataset for this training task. We will start the process by first cleaning the dataset down to around 200 rows, with a good variety of content and length in our selections. This will ensure that we have a good sample space for training the model from. Once the dataset is cleaned (trimmed down to 200 rows) we will need to label each caption according to the keywords that we are using for the project. The labelling will be done initially by GPT-4-Turbo, and then we will manually review and edit the labels using an excel sheet before we use the dataset to fine-tune our Bert Classifier.

This notebook will cover: 
- **Data Cleaning**
- **GPT Based Automatic Labeling**

### Data Cleaning

To begin, we need to load our dataset into a dataframe and filter out all the rows that we will not be using. There are over 20k rows, so we will have plenty of captions to choose from. Our raw dataset is stored in the `../data/social/captions.csv` file.

In [1]:
# Random state for trimming rows. DO NOT CHANGE, as this will affect the values of the dataset
RANDOM_STATE = 100

In [2]:
import pandas as pd

df = pd.read_csv("../data/social/captions.csv")

# Drop unused columns and rename columns for consistency
df.drop(columns=["Sr No", "Image File"], inplace=True)
df.dropna(subset=['Caption'], inplace=True)
df.rename(columns={"Caption": "caption"}, inplace=True)

# Set a word count column
df['word_count'] = df['caption'].str.split().str.len()

# Filter out all rows with less than 10 words
df = df[df['word_count'] >= 12]
df.describe()

Unnamed: 0,word_count
count,3454.0
mean,30.450782
std,35.244671
min,12.0
25%,14.0
50%,20.0
75%,32.0
max,402.0


In [3]:
captions_list = df['caption'].tolist()

In [4]:
import json

with open("../data/venues/yelp.json", "r", encoding="utf-8") as f:
    venues = json.load(f)
    # Mash the content of the reviews into a single string for each location
    reviews_list = []
    for location in venues:
        review_content = ""
        for review in location["reviews"]:
            review_text = next(iter(review.values()))
            review_content += review_text + " "
        reviews_list.append(review_content)

We can see that we have extracted a set of 250 values from the dataset, while preserving the varied distribution of the caption length. Now, we will move on to the labelling process.

### Setting up a Topic Extraction Pipeline

Now, we want to extract relevant topics from each of our dataset seperately, and then view the combined results to settle on a good list of keywords.

**Keyword Extraction and Scoring**

We can begin by using TF-IDF NLP techniques, to extract keywords from each corpus. TD-IDF ("Term Frequency-Inverse Document Frequency) can be described as:

1. **Term Frequency (TF)**: This is a measure of how frequently a term appears in a document (in this case, a social media caption). It's calculated by dividing the number of times the term appears in the document by the total number of terms in the document. This reflects how important a term is within that specific document.

2. **Inverse Document Frequency (IDF)**: This measures the importance of the term across the entire corpus (your collection of social media captions). It's calculated by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient. This step reduces the weight of terms that appear very frequently across the corpus, as these terms are less informative (common terms like 'the', 'is', etc.).

3. **TF-IDF Score**: This is simply the product of TF and IDF. It's a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The intuition here is that a term is important for a document if it appears frequently in that document but not in many other documents. Thus, TF-IDF tends to filter out common terms that appear in many documents (like generic words) and highlight terms that are more unique to specific documents.

In [5]:
import re
import string
from typing import List

import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')


def preprocess(caption: List[str]) -> List[str]:
    """Preprocesses the captions by removing special characters and converting to lowercase"""
    caption = re.sub(r'[^\w\s]', '', caption.lower())
    tokens = word_tokenize(caption)

    # Remove punctuation
    table = str.maketrans('', '', string.punctuation)
    stripped = [w.translate(table) for w in tokens]

    # Remove non-alphabetic tokens and stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in stripped if word.isalpha() and word not in stop_words]
    return words

preprocessed_captions = [" ".join(preprocess(caption)) for caption in captions_list]
preprocessed_reviews = [" ".join(preprocess(review)) for review in reviews_list]

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/jackmoffatt/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jackmoffatt/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
from typing import Tuple

def score_and_sort(documents: List[str]) -> List[Tuple[str, float]]:
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(preprocessed_captions)

    # Get feature names and TF-IDF score of each word
    scores = zip(vectorizer.get_feature_names_out(), tfidf_matrix.sum(axis=0).tolist()[0])
    sorted_scores = sorted(scores, key=lambda x: x[1], reverse=True)

    return sorted_scores

sorted_caption_scores = score_and_sort(preprocessed_captions)
sorted_review_scores = score_and_sort(preprocessed_reviews)

# Display top N Caption keywords
top_n = 10
print(f"Top {top_n} Caption Keywords:")
for feature, score in sorted_caption_scores[:top_n]:
    print(f"{feature}: {score}")

# Display top N Review keywords
print(f"\nTop {top_n} Review Keywords:")
for feature, score in sorted_review_scores[:top_n]:
    print(f"{feature}: {score}")


Top 10 Caption Keywords:
love: 96.71974191882008
thank: 69.25418802744699
im: 65.9716268084095
new: 64.68693526132691
happy: 56.001911154939414
birthday: 49.49981927072941
much: 45.18046579772852
see: 44.94033748513356
cant: 44.32868285776308
collection: 44.240309430604306

Top 10 Review Keywords:
love: 96.71974191882008
thank: 69.25418802744699
im: 65.9716268084095
new: 64.68693526132691
happy: 56.001911154939414
birthday: 49.49981927072941
much: 45.18046579772852
see: 44.94033748513356
cant: 44.32868285776308
collection: 44.240309430604306


**Topic Modeling Each Corpus**

Now that we have completed the TF-IDF process, we can move onto Topic Modeling. Topic modeling can uncover underlying themes or topics in large collections of text, such as your dataset of social media captions. One of the most popular methods for topic modeling is Latent Dirichlet Allocation (LDA).

In [7]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

def get_topics(documents: List[str], n_components: int = 10, top_n: int = 10) -> List[str]:
    vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
    count_matrix = vectorizer.fit_transform(documents)

    lda = LatentDirichletAllocation(n_components=n_components, random_state=RANDOM_STATE)
    lda.fit(count_matrix)

    # Get feature names and TF-IDF score of each word
    feature_names = vectorizer.get_feature_names_out()
    topic_keywords = []
    for topic in lda.components_:
        topic_keywords.append([feature_names[i] for i in topic.argsort()[-top_n:]])

    return topic_keywords

num_topics = 15
top_n = 25
caption_topics = get_topics(preprocessed_captions, n_components=num_topics, top_n=top_n)
review_topics = get_topics(preprocessed_reviews, n_components=num_topics, top_n=top_n)


In [8]:
# Display top N Caption topics
print(f"Top {num_topics} Caption Corpus Topics:")
for i, topic in enumerate(caption_topics):
    print(f"Topic {i}: {topic}")

# Display top N Review topics
print(f"\nTop {num_topics} Venue Corpus Topics:")
for i, topic in enumerate(review_topics):
    print(f"Topic {i}: {topic}")

Top 15 Caption Corpus Topics:
Topic 0: ['people', 'guys', 'better', 'heart', 'going', 'lol', 'mom', 'dont', 'grateful', 'world', 'time', 'family', 'make', 'ive', 'years', 'today', 'thank', 'best', 'know', 'im', 'day', 'birthday', 'life', 'happy', 'love']
Topic 1: ['party', 'ready', 'friends', 'way', 'come', 'god', 'amazing', 'time', 'life', 'sweet', 'little', 'baby', 'guys', 'best', 'big', 'day', 'dont', 'like', 'im', 'girl', 'know', 'happy', 'birthday', 'love', 'thank']
Topic 2: ['use', 'time', 'heart', 'amazing', 'miami', 'bronze', 'available', 'want', 'end', 'greatest', 'concealer', 'mom', 'tell', 'shades', 'thank', 'new', 'like', 'kits', 'love', 'highlight', 'better', 'im', 'kkwbeautycom', 'powder', 'contour']
Topic 3: ['ive', 'true', 'team', 'styled', 'dream', 'amazing', 'believe', 'photo', 'night', 'day', 'came', 'like', 'loved', 'la', 'vogue', 'new', 'im', 'come', 'shot', 'make', 'cover', 'love', 'thank', 'hair', 'shoot']
Topic 4: ['best', 'saw', 'hope', 'isnt', 'world', 'let', 

### Results

With our topics properly modeled, we now can pass this list to Chat GPT for assistance in using it to create "Keywords" for relating Venues and Social Media posts. After discussing pros and cons with Chat GPT, I came to the decision that it will be best to use "Personas" instead of just keywords. As a result, ChatGPT was able to extract 8 common personas across the two corpa of text data. After many interations, this is the final Persona list we came up with:
```json
{
    "socialButterfly": "The Social Butterfly: A vibrant and outgoing individual who thrives in the energy of social gatherings, frequently found enjoying the nightlife at lively bars and clubs, and always up for a celebration with friends.",
    "culinaryExplorer": "The Culinary Explorer: A gourmet aficionado who revels in culinary adventures, exploring diverse cuisines at fine dining establishments, and sharing their love for unique and delicious food experiences.",
    "beautyFashionAficionado": "The Beauty and Fashion Aficionado: A trendsetter passionate about the latest in fashion and beauty, often seen at stylish shopping venues and beauty product launches, and always keeping up with the newest trends.",
    "familyOrientedIndividual": "The Family-Oriented Individual: A person who cherishes family time and creates memories with loved ones, often participating in family-friendly activities, visiting parks, and enjoying experiences that cater to all ages.",
    "artCultureEnthusiast": "The Art and Culture Enthusiast: A lover of the arts and culture, often found absorbing the rich experiences offered by museums and galleries, and always seeking to expand their horizons through artistic and cultural exploration.",
    "wellnessSelfCareAdvocate": "The Wellness and Self-Care Advocate: A seeker of tranquility and personal well-being, often indulging in self-care routines, visiting wellness retreats and spas, and embracing serene natural environments for relaxation.",
    "adventurerExplorer": "The Adventurer and Explorer: An intrepid soul with a thirst for adventure, often embarking on exciting journeys, exploring the great outdoors, and engaging in activities that offer a rush of adrenaline and connection with nature.",
    "ecoConsciousConsumer": "The Eco-Conscious Consumer: A dedicated advocate for sustainability and eco-friendly living, preferring to shop at environmentally conscious stores, visit farmers' markets, and support initiatives that align with their green lifestyle."
}
```
And with this, we have our 8 labels. These personas will be used as the 8 classification heads of the social media sentiment model, as well as the embedding terms to use for Venue's when creating Venue-Persona relationships. This data can be found in the `../data/personas.json` directory.

As a final step, we will create a pickled dataframe with our personas, so we can perform similarity searches.s

In [12]:
from openai import OpenAI

personas = {
    "socialButterfly": "The Social Butterfly: A vibrant and outgoing individual who thrives in the energy of social gatherings, frequently found enjoying the nightlife at lively bars and clubs, and always up for a celebration with friends.",
    "culinaryExplorer": "The Culinary Explorer: A gourmet aficionado who revels in culinary adventures, exploring diverse cuisines at fine dining establishments, and sharing their love for unique and delicious food experiences.",
    "beautyFashionAficionado": "The Beauty and Fashion Aficionado: A trendsetter passionate about the latest in fashion and beauty, often seen at stylish shopping venues and beauty product launches, and always keeping up with the newest trends.",
    "familyOrientedIndividual": "The Family-Oriented Individual: A person who cherishes family time and creates memories with loved ones, often participating in family-friendly activities, visiting parks, and enjoying experiences that cater to all ages.",
    "artCultureEnthusiast": "The Art and Culture Enthusiast: A lover of the arts and culture, often found absorbing the rich experiences offered by museums and galleries, and always seeking to expand their horizons through artistic and cultural exploration.",
    "wellnessSelfCareAdvocate": "The Wellness and Self-Care Advocate: A seeker of tranquility and personal well-being, often indulging in self-care routines, visiting wellness retreats and spas, and embracing serene natural environments for relaxation.",
    "adventurerExplorer": "The Adventurer and Explorer: An intrepid soul with a thirst for adventure, often embarking on exciting journeys, exploring the great outdoors, and engaging in activities that offer a rush of adrenaline and connection with nature.",
    "ecoConsciousConsumer": "The Eco-Conscious Consumer: A dedicated advocate for sustainability and eco-friendly living, preferring to shop at environmentally conscious stores, visit farmers' markets, and support initiatives that align with their green lifestyle."
}


data = [{'persona': persona, "description": description} for persona, description in personas.items()]
df = pd.DataFrame(data=data)

client = OpenAI()

def embed_terms(terms: List[str]) -> List[List[float]]:
    """Embeds the given terms using the OpenAI API"""
    response = client.embeddings.create(input=terms, model="text-embedding-ada-002")
    return [datum.embedding for datum in response.data]

df['embeddings'] = df['description'].apply(lambda x: embed_terms([x])[0])
df.to_pickle("../data/persona_dataframe.pkl")