# Social Media Dataset Labeling

With our venue data saved and stored in the graph and vectorstore index. We have narrowed our labels down to a set of 8, which can be found in the `../data/personas.json` file. We will now want to load, trim and label our [Instagram Images with Captions](https://www.kaggle.com/datasets/prithvijaunjale/instagram-images-with-captions) dataset. 

In [7]:
# Random state for trimming rows. DO NOT CHANGE, as this will affect the values of the dataset
RANDOM_STATE = 100
DATASET_SIZE = 300

### Data Set Trimming
First, we must clean and trim our dataset down to a suitable number of examples. To do this, we will first trim down the number of rows using some basic threshold values, and then, we will randomly select 300 rows, in order to perserve the distribution of word count and content in the dataset.

In [125]:
import pandas as pd

CAPTIONS_DROP_COLUMNS = ["Sr No", "Image File"]

df = pd.read_csv("../data/social/captions_full_kaggle.csv")

# Drop unused columns and rename columns for consistency
df.drop(columns=CAPTIONS_DROP_COLUMNS, inplace=True)
df.dropna(subset=['Caption'], inplace=True)
df.rename(columns={"Caption": "caption"}, inplace=True)

# # Set a word count column
df['word_count'] = df['caption'].str.split().str.len()

# # Filter out all rows with less than 10 words
df = df[df['word_count'] >= 12]
df.describe()

Unnamed: 0,word_count
count,3454.0
mean,30.450782
std,35.244671
min,12.0
25%,14.0
50%,20.0
75%,32.0
max,402.0


In [126]:
# Select a random sample of 300 rows, which we will use as our labeled dataset.
# The idea is to use the following splits: 
# - 200 for training
# - 50 for validation
# - 50 for testing
df = df.sample(n=DATASET_SIZE, random_state=RANDOM_STATE)

In [127]:
df.to_csv("../data/social/captions_trimmed_kaggle.csv", index=False)

In [99]:
sample_captions = df['caption'][50:60].tolist()
sample_captions

["JO JO you're insane. Thank you from the bottom of my heart. I walked into my home to this! ",
 "Happy Birthday to the funniest, most down to earth guy I know!!! I can't wait to celebrate with you next week!!!! I love you so much !!!! ",
 '20 minutes until the album. Tomorrow at noon local the CHANGES TOUR goes on sale. See you out there. Thanks ',
 'Plastic doll vibes ;) pulling my face back with tape all day was much worth it for this amazing art cover of with & so dope!!!! ',
 'Something I wish I knew about when I was 18 and voting for the first time: ✨EARLY VOTING✨. It makes it so quick and easy to go and cast your vote before November 6. Early voting starts TODAY in Tennessee and goes to Nov 1 🇺🇸 You can check out your state’s early voting dates at the link in my bio ',
 'For over 11 years, Mario and I have worked together and created some incredible glam moments. I’ve watched him grow from a talented young artist who slowly and respectfully worked his way up through the years in

### Dataset Labeling

To label the dataset, we will start by using GPT-3.5-Turbo. By carefully formatting a prompt for GPT-3.5-Turbo, we will be able to get a base set of labels for our 300 captions, then, we will store these results in a CSV file, which will allow us to use Excel, or another spreadsheet editing software, to manually edit the labels to ensure high quality results.

In [13]:
SYSTEM_PROMPT = """
You are a data social media consultant specializing in analyzing trends across various social 
media platforms. Currently, you are working with a client to identify trends in Instagram 
captions and understand the interests of followers.

Your primary task involves analyzing Instagram captions and identifying the most likely 
"personas" of the followers of the individual or business that posted each caption. You 
will use a predefined list of 8 personas to categorize the captions.

For each caption provided to you, your job is to select any personas that accurately depict 
the followers of the individual or business who posted the caption. Remember, you must select
at least 1 but no more than 3 personas for each caption. You will receive captions in batches of 20.

Below is the detailed list of the 8 personas you will use for categorization:

1. **The Social Butterfly**: A vibrant and outgoing individual who thrives in the energy of social gatherings, frequently found enjoying the nightlife at lively bars and clubs, and always up for a celebration with friends.

2. **The Culinary Explorer**: A gourmet aficionado who revels in culinary adventures, exploring diverse cuisines at fine dining establishments, and sharing their love for unique and delicious food experiences.

3. **The Beauty and Fashion Aficionado**: A trendsetter passionate about the latest in fashion and beauty, often seen at stylish shopping venues and beauty product launches, and always keeping up with the newest trends.

4. **The Family-Oriented Individual**: A person who cherishes family time and creates memories with loved ones, often participating in family-friendly activities, visiting parks, and enjoying experiences that cater to all ages.

5. **The Art and Culture Enthusiast**: A lover of the arts and culture, often found absorbing the rich experiences offered by museums and galleries, and always seeking to expand their horizons through artistic and cultural exploration.

6. **The Wellness and Self-Care Advocate**: A seeker of tranquility and personal well-being, often indulging in self-care routines, visiting wellness retreats and spas, and embracing serene natural environments for relaxation.

7. **The Adventurer and Explorer**: An intrepid soul with a thirst for adventure, often embarking on exciting journeys, exploring the great outdoors, and engaging in activities that offer a rush of adrenaline and connection with nature.

8. **The Eco-Conscious Consumer**: A dedicated advocate for sustainability and eco-friendly living, preferring to shop at environmentally conscious stores, visit farmers' markets, and support initiatives that align with their green lifestyle.

Your responses should be formatted into a JSON object according to the following schema:
```
{
    "type": "object",
    "description": "A JSON object containing the personas selected for each caption",
    "properties": {
        "captions": {
            "type": "array",
            "items": {
                "type": "object",
                "description": "A JSON object containing a caption and the personas selected for that caption",
                "properties": {
                    "caption": {
                        "type": "string",
                        "description": "The caption that was provided to you"
                    },
                    "personas": {
                        "type": "array",
                        "items": {
                            "type": "string",
                            "enum": [
                                "socialButterfly",
                                "culinaryExplorer",
                                "beautyFashionAficionado",
                                "familyOrientedIndividual",
                                "artCultureEnthusiast",
                                "wellnessSelfCareAdvocate",
                                "adventurerExplorer",
                                "ecoConsciousConsumer",
                            ],
                            "description": "The personas that you selected for the caption"
                            "minItems": 1,
                            "maxItems": 3,
                        },
                    },
                    "required": ["caption", "personas"]
                }
            }
        }
    }
}
```
The JSON response you provide should be described by the above schema. Note that this is the schema desrcibing the JSON object,
not the actual JSON object itself.
"""

In [32]:
from pprint import pprint

example_output = {
    "captions": [
        {
            "caption": "Rainy days, cozy blanket, and finally getting to the last chapter of this incredible book",
            "personas": ["artCultureEnthusiast"]
        },
        {
            "caption": "A spontaneous road trip led us to this breathtaking view. Can't get enough of these mountain vibes! 🏔️ #Wanderlust",
            "personas": ["adventurerExplorer"]
        },
        {
            "caption": "Had the best organic avocado toast at this new cafe downtown. Clean eating never tasted so good!",
            "personas": ["culinaryExplorer", "ecoConsciousConsumer"]
        },
        {
            "caption": "Late night baking session with the kids - our kitchen is a mess but our hearts are full!",
            "personas": ["familyOrientedIndividual"]
        },
        {
            "caption": "Throwback to that epic party last weekend. Can we do it all over again? 🎉",
            "personas": ["socialButterfly"]
        },
        {
            "caption": "You guys have to checkout Potted Studios on main street -- the cutest little hand crafted pottery shop!",
            "personas": ["artCultureEnthusiast", "ecoConsciousConsumer"]
        },
        {
            "caption": "Morning yoga in the garden, the best start to my day",
            "personas": ["wellnessSelfCareAdvocate", "ecoConsciousConsumer"]
        },
        {
            "caption": "Obsessing over my latest thrift shop finds – vintage fashion is an art!",
            "personas": ["beautyFashionAficionado", "artCultureEnthusiast",]
        },
        {
            "caption": "Exploring the farmer's market and making new friends. Love supporting local!",
            "personas": ["ecoConsciousConsumer", "socialButterfly"]
        },
        {
            "caption": "That moment when you realzie you forgot the suncreen in the car...",
            "personas": ["adventurerExplorer"]
        },
        {
            "caption": "Experimenting with a new recipe tonight. Plant based has been going well so far, update vlog for you guys tomorrow!",
            "personas": ["culinaryExplorer", "ecoConsciousConsumer"]
        },
        {
            "caption": "Beach walks are my therapy",
            "personas": ["wellnessSelfCareAdvocate", "adventurerExplorer"]
        },
        {
            "caption": " Still blown away with meeting my all time favorite author. Thank you @bookcon for making this happen!",
            "personas": ["artCultureEnthusiast"]
        },
        {
            "caption": "Saturdays are for spa treatments and self-pampering, need to get ready for a big night",
            "personas": ["wellnessSelfCareAdvocate", "socialButterfly"]
        },
        {
            "caption": "Just a casual Friday, exploring the local art scene.",
            "personas": ["artCultureEnthusiast"]
        },
        {
            "caption": "Found the cutest little bistro hidden away in the city. Their homemade pasta is to die for!",
            "personas": ["culinaryExplorer"]
        },
        {
            "caption": "Family game night is getting competitive -- but I wouldn't have it any other way!",
            "personas": ["familyOrientedIndividual"]
        },
        {
            "caption": "Taking fashion risks and loving every minute of it. Who says you can't mix patterns?",
            "personas": ["beautyFashionAficionado"]
        },
        {
            "caption": "Early morning hike with friends, followed by a well-deserved brunch",
            "personas": ["adventurerExplorer", "culinaryExplorer"]
        },
        {
            "caption": "Dreaming of freshair and sunshine. Where should I go next?",
            "personas": ["adventurerExplorer"]
        }
    ]
}

example_captions = [row['caption'] for row in example_output['captions']]

personas_present = {}
for row in example_output['captions']:
    for persona in row['personas']:
        personas_present[persona] = personas_present.get(persona, 0) + 1

pprint(personas_present)

{'adventurerExplorer': 5,
 'artCultureEnthusiast': 5,
 'beautyFashionAficionado': 2,
 'culinaryExplorer': 4,
 'ecoConsciousConsumer': 5,
 'familyOrientedIndividual': 2,
 'socialButterfly': 3,
 'wellnessSelfCareAdvocate': 3}


In [33]:
import json
from typing import List, Union, Dict

from openai import OpenAI


def label_batch(batch: List[str]) -> List[Dict[str, Union[str, List[str]]]]:
    """Label a batch of captions using the OpenAI API."""
    client = OpenAI()
    messages = [
        {'role': 'system', 'content': SYSTEM_PROMPT},
        {'role': 'user', 'content': json.dumps({'captions': example_captions})},
        {'role': 'user', 'content': json.dumps(example_output)},
        {'role': 'user', 'content': json.dumps({'captions': [{'caption': caption} for caption in batch]})},
    ]
    completion = client.chat.completions.create(
        model='gpt-4-1106-preview',
        messages=messages,
        response_format={'type': 'json_object'}
    )
    return completion.choices[0].message.content

In [34]:
from pprint import pprint
from tqdm import tqdm

captions = df.caption.tolist()
labeled_captions = []

for i in tqdm(range(0, len(captions), 20)):
    batch = captions[i:i+20]
    labeled_batch = label_batch(batch)
    labeled_batch_object = json.loads(labeled_batch)
    labeled_captions.extend(labeled_batch_object['captions'])

labeled_captions

100%|██████████| 15/15 [15:52<00:00, 63.52s/it]


[{'caption': 'Love this shoot I did with for my app & website ", '},
 {'caption': 'My aunty continues to make me feel so pretty :) love you!!! ',
  'personas': ['beautyFashionAficionado', 'familyOrientedIndividual']},
 {'caption': 'Selfies about to be LIT! \r\nToday on my app I talk about the perfect selfie lighting! The secret to my selfies Lumee.com ',
  'personas': ['beautyFashionAficionado']},
 {'caption': 'I did the most amazing make up tutorial for my app with none other than We did her classic "rock chick" look with all Charlotte Tilbury make up! You have to follow her & get inspired! check out our tutorial! ',
  'personas': ['beautyFashionAficionado']},
 {'caption': "TONIGHT AT 9/8c on NBC! Please don't hate me after you watch this \r\n",
  'personas': ['socialButterfly']},
 {'caption': 'Look what I found in my luggage while unpacking!!! Casa Aramara knows me too well! This just made my day! ',
  'personas': ['wellnessSelfCareAdvocate']},
 {'caption': "Appreciation post for the

In [77]:

personas = ["socialButterfly", "culinaryExplorer", "beautyFashionAficionado", "familyOrientedIndividual", "artCultureEnthusiast", "wellnessSelfCareAdvocate", "adventurerExplorer", "ecoConsciousConsumer"]


# Add false values
dataset = []
failed = 0
for row in labeled_captions:
    dataset_row = {'caption': row['caption']}
    for persona in personas:
        try:
            dataset_row[persona] = 1 if persona in row['personas'] else 0
        except:
            failed += 1
            dataset_row[persona] = None
    dataset.append(dataset_row)


In [111]:

labeled_df = pd.DataFrame(dataset)
labeled_df.to_csv("../data/social/labeled_captions_kaggle.csv", index=False)

# Get the count of each persona label
labeled_df.describe().loc['mean'] * 300

socialButterfly              87.0
culinaryExplorer              3.0
beautyFashionAficionado     119.0
familyOrientedIndividual     65.0
artCultureEnthusiast         38.0
wellnessSelfCareAdvocate     17.0
adventurerExplorer           23.0
ecoConsciousConsumer          7.0
Name: mean, dtype: float64

### Generated Dataset

After reviewing the class labels of the Kaggle dataset, we found that there were severe imbalances in label frequencies. This dataset seemed very heavily focused on fashion and influcer related categories. To remedy this, we will have chat GPT generate a new set of labeled captions, with the goal of creating a diverse dataset of captions, that equally distribute the class labels.

We used ChatGPT to prevent excessive token usage. The results of the dataset creator can be found in `../data/social/labeled_captions.json`


In [129]:
with open("../data/social/labeled_captions_gpt.json", "r") as f:
    labeled_captions = json.load(f)

# Add false values
dataset = []
failed = 0
for row in labeled_captions:
    dataset_row = {'caption': row['caption']}
    for persona in personas:
        try:
            dataset_row[persona] = 1 if persona in row['personas'] else 0
        except:
            failed += 1
            dataset_row[persona] = None
    dataset.append(dataset_row)

labeled_df = pd.DataFrame(dataset)
print(f"Total Caption Count: {labeled_df.shape[0]}")
labeled_df.describe().loc['mean'] * labeled_df.shape[0]

Total Caption Count: 300


socialButterfly             48.0
culinaryExplorer            45.0
beautyFashionAficionado     50.0
familyOrientedIndividual    59.0
artCultureEnthusiast        69.0
wellnessSelfCareAdvocate    49.0
adventurerExplorer          52.0
ecoConsciousConsumer        69.0
Name: mean, dtype: float64

In [130]:
labeled_df.to_csv("../data/social/labeled_captions_gpt.csv", index=False)