Most of this code comes from succinctlyai/midjourney-prompt-analysis kaggle notebook. I downloaded and used what made sense to use for the purpose of the guess-the-prompt app: getting a list of prompts and urls generated in MidJourney. This list was added to a json file that will be used to populate de app's DB. 

<h1> Read and Categorize the Discord Messages </h1>

Each user request for image generation is encoded as a [Message](https://discord.com/developers/docs/resources/channel#message-object) object.

In [2]:
import json
import os

filepaths = []
for dirname, _, filenames in os.walk('./data'):
    for filename in filenames:
        filepaths.append(os.path.join(dirname, filename))
print(f"Found {len(filepaths)} files.")

Found 272 files.


As illustrated in the dataset description, users interact with the MidJourney bot in three ways:

1. Issue a new text prompt (with an optional image prompt) to request image generation.
2. Request variations of a previously generated image.
3. Request upscaling a previously generated image.

The majority of messages reflect one of these three intents. Here we ignore all other messages (e.g. text messages simply saying hello).

In [3]:
from collections import defaultdict
import glob
import json

# Detect the message type based on the UI components shown to the user.
# See https://discord.com/developers/docs/interactions/message-components#what-is-a-component
COMPONENTS_FOR_INITIAL_AND_VARIATION = set(
    ['U1', 'U2', 'U3', 'U4', '⟳', 'V1', 'V2', 'V3', 'V4'])
COMPONENTS_FOR_UPSCALE = set(
    ['Make Variations', 'Upscale to Max', 'Light Upscale Redo'])


def get_message_type(message):
  """Figures out the message type based on the UI components displayed."""
  for components in message["components"]:
    for component in components["components"]:
      if component["label"] in COMPONENTS_FOR_INITIAL_AND_VARIATION:
        # For (very few) messages that are supposedly initial or variation requests, the content indicates
        # that they are actually upscale requests. We will just put these aside.
        if "Upscaled" in message["content"]:
          return "INCONCLUSIVE"
        return "INITIAL_OR_VARIATION"
      elif component["label"] in COMPONENTS_FOR_UPSCALE:
        return "UPSCALE"
  return "TEXT_MESSAGE"

messages_by_type = defaultdict(list)
for filepath in filepaths:
  with open(filepath, "r") as f:
    content = json.load(f)
    for single_message_list in content["messages"]:
      assert len(single_message_list) == 1
      message = single_message_list[0]
      message_type = get_message_type(message)
      messages_by_type[message_type].append(message)
        
print("Message counts:")
for mtype, messages in messages_by_type.items():
  print("\t", mtype, len(messages))

Message counts:
	 UPSCALE 102249
	 INITIAL_OR_VARIATION 145822
	 TEXT_MESSAGE 20309
	 INCONCLUSIVE 43


<h1> Explore User-Generated Prompts </h1>

In [4]:
import re

def get_prompt(message):
    """Extracts the prompt from the message content, which is located between double stars."""
    content = message["content"]
    # Replace newlines with spaces; makes the regex below work.
    content = content.replace("\n", " ")
    # Find the text enclosed by two consecutive stars.
    BETWEEN_STARS = "\\*\\*(.*?)\\*\\*"
    match = re.search(BETWEEN_STARS, content)
    if match:
        return match.group()[2:-2]  # Exclude the stars.
    

def remove_urls(prompt):
    """Prompts can include both text and images; this method removes the prompt image URLs."""
    URL = "<https[^<]*>?\s"
    matches = re.findall(URL, prompt)
    for match in matches:
        prompt = prompt.replace(match, "")
    return prompt
    

def get_generated_image_url(message):
    """Extracts the URL of the generated image from the message."""
    attachments = message["attachments"]
    if len(attachments) == 1:
        return attachments[0]["url"]

Going forward, we will focus on INITIAL_OR_VARIATION prompts, since the UPSCALE prompts are a subset of the former (an UPSCALE request is associated with the same prompt used to generate the image that is currently being upscaled).

In [15]:
from dataclasses import dataclass

user_requests = []
for m in messages_by_type["INITIAL_OR_VARIATION"]:
    prompt = get_prompt(m)
    generated_url = get_generated_image_url(m)
    # In *very* rare cases, messages are malformed and these fields cannot be extracted.
    if prompt and generated_url:
        user_requests.append({'prompt': prompt, 'generated_url': generated_url})
        
num_messages = len(messages_by_type["INITIAL_OR_VARIATION"])
print(f"Parsed {len(user_requests)} user requests from {num_messages} messages.")

Parsed 145080 user requests from 145822 messages.


Let's see how many of these requests include an image in the prompt.

In [16]:
total = len(user_requests)
with_url = sum([0 if remove_urls(r['prompt']) == r['prompt'] else 1 for r in user_requests])
print(f"{with_url} out of {total} INITIAL_AND_VARIATION prompts include an image")

70694 out of 145080 INITIAL_AND_VARIATION prompts include an image


Let's take out the entries with urls in the prompt:

In [18]:
user_requests = [r for r in user_requests if remove_urls(r['prompt']) == r['prompt']]
print(f"{len(user_requests)} INITIAL_AND_VARIATION prompts do not include an image")

74386 INITIAL_AND_VARIATION prompts do not include an image


Let's save it as json to later add it to our BD.

In [20]:
with open("./data/images.json", "w") as f:
    json.dump(user_requests, f)