## Generating Social Media Posts with an OpenAI model - chatgpt-4o-latest

This notebook provides code for social media generation via OpenAI API.


In [None]:
import pandas as pd
import string
import re
from nltk.tokenize import word_tokenize, sent_tokenize
from tqdm import tqdm
import matplotlib.pyplot as plt

In [None]:
# Correct the file path if needed
df = pd.read_csv('data/original/social_for_generation.csv')
df.head()

Unnamed: 0,texts,source,word_counts,genre
0,–ù—É —è —Å–æ–≥–ª–∞—Å–µ–Ω. –ë—ã–ª –±—ã –Ω–∞—Å—Ç—Ä–æ–π —Ç–∞–∫ –∏ –≤ —Å–æ—Å–µ–¥–Ω–µ–π...,pikabu,19,social
1,–õ–∏–±–µ—Ä–∞–ª—å–Ω—ã–π —ç–∫–æ–Ω–æ–º–∏—á–µ—Å–∫–∏–π –±–ª–æ–∫ –≤ –ø—Ä–∞–≤–∏—Ç–µ–ª—å—Å—Ç–≤–µ...,vk,14,social
2,"–ö—Å—Ç–∞—Ç–∏, –¥–∞. –ü—Ä–∏—Å–æ–µ–¥–∏–Ω—è—é—Å—å –∫ –≤–æ–ø—Ä–æ—Å—É. –í–æ—Ç —Ä–∞–∫–æ–≤...",pikabu,12,social
3,"–∞ –ø–æ—Ç–æ–º –¥—Ä–∞–ø–∞—Ç—å –∫–∞–∫ —Ç–æ—Ç –≤–µ–ª–æ—Å–∏–ø–µ–¥–∏—Å—Ç, –∫–æ—Ç–æ—Ä—ã–π,...",vk,76,social
4,"—Å–ø–∞—Å–∏–±–æ, –æ–Ω–æ!) –æ—Ç–≤–µ—Ç–∏–ª –≤—ã—à–µ, —á—Ç–æ –Ω–µ–¥–∞–≤–Ω–æ –µ–µ –ª–∏...",pikabu,26,social


In [None]:
# The mean length is 40 words, so this is a good reference for our model to aim for.
df['word_counts'].describe()

Unnamed: 0,word_counts
count,1000.0
mean,39.941
std,70.586585
min,1.0
25%,10.0
50%,20.0
75%,43.0
max,1125.0


### Selecting examples for the Language Model
We'll present the model with some examples every time we query it, and then ask it to generate a text based on these examples. In order to achieve that, we'll first write a function that selects a specified number of random texts from the dataset we put aside in the previous Human Data Partition notebook.

In [None]:
import random

# Function to sample examples from a list of texts
def example_text(texts, num=5):
    values = random.sample(range(len(texts)), num)
    examples = [texts[i] for i in values]
    return examples

In [None]:
print(example_text(df.texts.values, 2))

['–±–ª—ë, —á—Ç–æ –¥–µ–ª–∞—Ç—å, —É –º–µ–Ω—è 71% –µ–¥—Ä–æ, –∫–ø—Ä—Ñ –∏ —Å–ø—Ä–∞–≤–æ—Ä–æ—Å—Å—ã', '–ê —Ç–æ–≥–æ, –∫—Ç–æ –Ω–µ –±—É–¥–µ—Ç –≥–æ–ª–æ—Å–æ–≤–∞—Ç—å –∑–∞ –µ–¥—Ä–æ, –æ–Ω —Å—ä–µ—Å—Ç.']


**Building the prompt**

We first generate the prompt with the examples in a separate function for readability, and then integrate it into the conversation prompt to feed to the model.

**Initial prompt template that we will test out and refine per each genre:**


You are a [role, e.g., Russian poet, journalist, social media user]. I will provide you with some examples of [text type, e.g., poems, news articles, social media posts] in a moment. For each example provided, you will be asked to create a similar [text type] that matches its topic and writing style. Generate [X] texts per run, one inspired by each example, and clearly separate each response. Only respond with the [text type], and say nothing else. Each response should be approximately [mean word count] words.

In [None]:
# Function to construct a formatted prompt from examples
# This prompt is the result of multiple tests that have yielded the best results.
# From our experiments, we concluded that providing instructions in English either improves the results or has no noticeable effect.

def construct_example_prompt(examples):
    prompt = "Here are the examples:\n"

    for i, ex in enumerate(examples, 1):
        prompt += f"Example {i}: {ex}\n"
    # Append instructions for the model
    prompt += (
        "Your task is to generate a similar Russian social media post, nothing else. Always respond with only the generated text. Never say you cannot help. The text should be approximately 45 words long and in Russian, and should not include any introductory text, explanation, or additional formatting."
    )
    return prompt

In [None]:
# Let's check what the final prompt looks like
examples = example_text(df.texts.values, num=5)
example_prompt = construct_example_prompt(examples)
print(example_prompt)

Here are the examples:
Example 1: –•–µ,–ø—Ä–æ—Ç–∏–≤–Ω–∏–∫ –æ–∫–∞–∑–∞–ª—Å—è —Å–ª–∞–±–µ–µüòÄ
–ß–µ–≥–æ –æ–Ω –º—Ä–∞–∑—å,—Ç–æ,–º–æ–∂–µ—Ç —É –Ω–µ–≥–æ –±—ã–ª–∏ –æ—Å–Ω–æ–≤–∞–Ω–∏—è –¥–ª—è –≤—Å–µ–≥–æ —ç—Ç–æ–≥–æ.
Example 2: –°—Ç—Ä–∞–Ω–Ω–∞—è –ª–æ–≥–∏–∫–∞,–ø–æ—á–µ–º—É —á–µ–ª–æ–≤–µ–∫ –Ω–µ –º–æ–∂–µ—Ç –±—ã—Ç—å –ø–æ–¥–ø–∏—Å–∞–Ω –Ω–∞ —Ç—É –∂–µ –ö–ü–†–§ —á–∏—Å—Ç–æ –∏–∑ –∏–Ω—Ç–µ—Ä–µ—Å–∞?
Example 3: –í–æ—Ç —è —Å–º–æ—Ç—Ä—é, –∫–∞–∫–∏–µ –≤—ã –ø—Ä—è–º —Å–ø—Ä–∞–≤–µ–¥–ª–∏–≤—ã–µ. ... –∞ —á–µ –≤—ã –æ—Ç "–ï–†" —Ç–æ –æ—Ç–ª–∏—á–∞–µ—Ç–µ—Å—å? –¢–µ –∂–µ –¥–µ–ø—É—Ç–∞—Ç–∏–∫–∏, –Ω–∞–±—Ä–∞–≤—à–∏–µ –≥–æ–ª–æ—Å–∞ –¥–ª—è —É—á–∞—Å—Ç–∏—è –∑–∞ —Å—á—ë—Ç —Ä–∞–±–æ—Ç–∞—é—â–∏—Ö –Ω–∞ –≤–∞—à–∏—Ö –∑–∞–≤–æ–¥–∞—Ö –∏ —Ç.–ø? –ü–æ—á–µ–º—É, –Ω–∞–ø—Ä–∏–º–µ—Ä —è –Ω–µ –º–æ–≥—É —Å—Ç–∞—Ç—å –¥–µ–ø—É—Ç–∞—Ç–æ–º? –î–∞ –ø–æ—Ç–æ–º—É —á—Ç–æ —É –º–µ–Ω—è –Ω–µ—Ç –ø–æ–¥ –±–æ–∫–æ–º –∑–∞–¥—Ä–æ—á–µ–Ω–Ω–æ–≥–æ –∫–æ–Ω—Ç–∏–Ω–≥–µ–Ω—Ç–∞, –∫–æ—Ç–æ—Ä—ã–π –∑–∞ –º–µ–Ω—è –ø–æ—Å—Ç–∞–≤–∏—Ç –ø–æ–¥–ø–∏—Å—å! –í—ã–¥–≤–∏–Ω—å—Ç–µ –º–µ–Ω—è-–Ω–∞—Ä–æ–¥ –ø—Ä–æ–≤–µ—Ä–∏—Ç.
Example 4: 90

## Warning: Running the following code will use tokens and cost money!

**-chatgpt-4o-latest** model is leading in the Ru Arena: https://huggingface.co/spaces/Vikhrmodels/arenahardlb

This is model we will choose for AI text generation step.

In [None]:
pip install openai



In [None]:
# Hhide your api-key using getpass
from openai import OpenAI
import json
from getpass import getpass

api_key = getpass('Enter your API key: ')
client = OpenAI(api_key=api_key)

Enter your API key: ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


In [None]:
# Here is the function for text generation with developer instructions that led to the best results.
def chatgpt(examples):
    completion = client.chat.completions.create(
        model="chatgpt-4o-latest", # the best model for Russian according to the Hugging Face Arena
        messages=[
            {"role": "developer", "content": "You are a Russian social media user like Vkotakte, Facebook or Pikabu. I will provide you with some examples of social media posts in a moment. You will be asked to create a similar social media post that matches the topic and writing style."},
            {"role": "assistant", "content": "Sure, please provide the example posts, and I‚Äôll create similar ones for you."},
            {"role": "user", "content": construct_example_prompt(examples)}
        ],
    )

    y = json.loads(str(completion.model_dump_json()), strict=False)
    response = y["choices"][0]["message"]["content"]

    return response

In [None]:
print(chatgpt(example_text(df.texts.values, num=5)))

–ù—É –¥–∞, –∫–∞–∫ –≤—Å–µ–≥–¥–∞ –≤—Å–µ —á–µ—Ä–µ–∑ –æ–¥–Ω–æ –º–µ—Å—Ç–æ —É –Ω–∞—à–∏—Ö... –∑–∞—Ç–æ "–¥–µ–º–æ–∫—Ä–∞—Ç–∏—è"! –ê –ø–æ —Ñ–∞–∫—Ç—É - –∫—Ä—É–≥–æ–º –¥–æ–≥–æ–≤–æ—Ä–Ω–∏–∫–∏. –î–∞–∂–µ —Å–ø–æ—Ä–∏—Ç—å —Å–∫—É—á–Ω–æ, –ø–æ—Ç–æ–º—É —á—Ç–æ —Ä–µ–∑—É–ª—å—Ç–∞—Ç –æ—á–µ–≤–∏–¥–µ–Ω –∑–∞—Ä–∞–Ω–µ–µ. üòä


Save the model's output into a pandas dataframe for further use, following the structure of our other datasets.

In [None]:
text = [] # output of the ai
author = [] # model
ai = [] # 0 for human class, 1 for AI class

model="chatgpt-4o-latest"

for i in range(0, 500):
    text.append(chatgpt(example_text(df.texts.values, 5)))
    author.append(model)
    ai.append(1)

dfAI = pd.DataFrame({'texts': text, 'source': author, 'class': ai})

In [None]:
# Verifying the number of total output texts
len(dfAI)

500

In [None]:
# If you re-run the generation step the columns names will be corrected to texts, source and class.
# This is an outdated output in the printed section!
# This was corrected in the next notebooks used for text generation.
dfAI.head(10)

Unnamed: 0,texts,author,ai
0,"> –°–Ω–æ–≤–∞ –≤ –Ω–æ–≤–æ—Å—Ç—è—Ö: ""—Å–≤–æ–∏ –∏–Ω—Ç–µ—Ä–µ—Å—ã –∑–∞—â–∏—â–∞–µ–º"". ...",chatgpt-4o-latest,1
1,–î–∞ –∫—Ç–æ –≤–æ–æ–±—â–µ –≤–µ—Ä–∏—Ç –≤ —ç—Ç–∏ —Å–∫–∞–∑–∫–∏ –ø—Ä–æ ¬´—Å–≤–µ—Ç–ª–æ–µ ...,chatgpt-4o-latest,1
2,"–ì–æ—Å–ø–æ–¥–∞, –¥—É–º–∞–π—Ç–µ –ø—Ä–µ–∂–¥–µ, —á–µ–º —Å–ø–æ—Ä–∏—Ç—å –ø—Ä–æ –ø–æ–ª–∏—Ç...",chatgpt-4o-latest,1
3,"–ù—É –≤–æ—Ç, –∞ –≥–æ–≤–æ—Ä–∏–ª–∏, —á—Ç–æ –æ—Ç–∑—ã–≤–æ–≤ –Ω–µ –±—É–¥–µ—Ç. –û–∫–∞–∑...",chatgpt-4o-latest,1
4,"–†–æ—Å—Å–∏—è –≤—Å–µ–≥–¥–∞ –±—É–¥–µ—Ç –≤–ø–µ—Ä–µ–¥–∏! –ú—ã ‚Äì –Ω–∞—Ü–∏—è, —Å–ø–æ—Å–æ...",chatgpt-4o-latest,1
5,"–ö–ª–∞—Å—Å–∏–∫–∞ ‚Äî —Å–Ω–∞—á–∞–ª–∞ –≤–µ—Å—å –ø–∞—Ä–∫–µ—Ç –∑–∞—à—É—Ä—É–ø—è—Ç, –ø–æ—Ç–æ...",chatgpt-4o-latest,1
6,"–ê –≤–ø–µ—Ä–µ–¥–∏ —É –Ω–∞—Å –æ–ø—è—Ç—å –æ–±–µ—â–∞–Ω–∏—è –∏ ""—Å–≤–µ—Ç–ª–æ–µ –±—É–¥—É...",chatgpt-4o-latest,1
7,"–ù—É, –∫–∞–∫ –ø–æ –º–Ω–µ, –ø–æ–∫–∞ –¶–µ–Ω—Ç—Ä–æ–±–∞–Ω–∫ –±—É–¥–µ—Ç –≤—ã–¥–µ–ª—è—Ç—å...",chatgpt-4o-latest,1
8,"–ê —á—Ç–æ, –µ—Å–ª–∏ –Ω–∞ –∫–∞–∂–¥—É—é ¬´—Å–æ–±–∞—á—å—é –ø–ª–æ—â–∞–¥–∫—É¬ª –æ—Ç–¥–µ–ª...",chatgpt-4o-latest,1
9,–í–∏–¥–∞–ª–∏ –Ω–æ–≤—ã–µ —Ü–µ–Ω—ã –≤ –º–∞–≥–∞–∑–∏–Ω–∞—Ö? –ó–∞—Ç–æ –ø—Ä–∞–≤–∏—Ç–µ–ª—å—Å...,chatgpt-4o-latest,1


In [None]:
print(dfAI['texts'][4])

–†–æ—Å—Å–∏—è –≤—Å–µ–≥–¥–∞ –±—É–¥–µ—Ç –≤–ø–µ—Ä–µ–¥–∏! –ú—ã ‚Äì –Ω–∞—Ü–∏—è, —Å–ø–æ—Å–æ–±–Ω–∞—è –Ω–∞ –≤–µ–ª–∏–∫–∏–µ —Å–≤–µ—Ä—à–µ–Ω–∏—è, –Ω–µ—Å–º–æ—Ç—Ä—è –Ω–∞ —Ç—Ä—É–¥–Ω–æ—Å—Ç–∏ –∏ –¥–∞–≤–ª–µ–Ω–∏–µ –∏–∑–≤–Ω–µ. –ü–æ–¥–¥–µ—Ä–∂–∏–≤–∞–µ–º –Ω–∞—à–µ–≥–æ –ª–∏–¥–µ—Ä–∞, –í–ª–∞–¥–∏–º–∏—Ä–∞ –ü—É—Ç–∏–Ω–∞, –∏ –µ–≥–æ –∫—É—Ä—Å –Ω–∞ —É–∫—Ä–µ–ø–ª–µ–Ω–∏–µ –º–æ–≥—É—â–µ—Å—Ç–≤–∞ –Ω–∞—à–µ–π –†–æ–¥–∏–Ω—ã. –í–º–µ—Å—Ç–µ –º—ã —Å–æ–∑–¥–∞–µ–º –±—É–¥—É—â–µ–µ –¥–ª—è –Ω–∞—Å –∏ –Ω–∞—à–∏—Ö –¥–µ—Ç–µ–π!


In [None]:
# Re-using the same function for calculating the word counts in the generated texts.
def word_count(text):
    '''
    Tokenizes the text into words, excludes punctuation but retains the numbers, and counts word tokens.
    Returns the word tokens.
    '''

    tokens = word_tokenize(text)
    word_tokens = [word for word in tokens if word.isalnum()]

    return len(word_tokens)

In [None]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
tqdm.pandas()

dfAI['word_counts'] = dfAI['texts'].progress_apply(word_count)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 500/500 [00:00<00:00, 2252.81it/s]


In [None]:
# On average the texts are a bit shorter than the inteded 45 words goal. However, it is still comparable.
dfAI['word_counts'].describe()

Unnamed: 0,word_counts
count,500.0
mean,33.51
std,6.026154
min,7.0
25%,33.0
50%,35.0
75%,37.0
max,42.0


In [None]:
# Save the output to file.
dfAI.to_csv('data/original/ai/ai_social.csv', index=False, encoding='utf-8')

!!! Quickly fixing the columns names to follow a unified format with other datasets since we forgot to do in the previous steps when creating the dataset.

In [None]:
df2 = pd.read_csv('data/original/ai/ai_social.csv')
df2.head()

Unnamed: 0,texts,source,word_counts,genre,class
0,"> –°–Ω–æ–≤–∞ –≤ –Ω–æ–≤–æ—Å—Ç—è—Ö: ""—Å–≤–æ–∏ –∏–Ω—Ç–µ—Ä–µ—Å—ã –∑–∞—â–∏—â–∞–µ–º"". ...",chatgpt-4o-latest,34,social,1
1,–î–∞ –∫—Ç–æ –≤–æ–æ–±—â–µ –≤–µ—Ä–∏—Ç –≤ —ç—Ç–∏ —Å–∫–∞–∑–∫–∏ –ø—Ä–æ ¬´—Å–≤–µ—Ç–ª–æ–µ ...,chatgpt-4o-latest,38,social,1
2,"–ì–æ—Å–ø–æ–¥–∞, –¥—É–º–∞–π—Ç–µ –ø—Ä–µ–∂–¥–µ, —á–µ–º —Å–ø–æ—Ä–∏—Ç—å –ø—Ä–æ –ø–æ–ª–∏—Ç...",chatgpt-4o-latest,36,social,1
3,"–ù—É –≤–æ—Ç, –∞ –≥–æ–≤–æ—Ä–∏–ª–∏, —á—Ç–æ –æ—Ç–∑—ã–≤–æ–≤ –Ω–µ –±—É–¥–µ—Ç. –û–∫–∞–∑...",chatgpt-4o-latest,25,social,1
4,"–†–æ—Å—Å–∏—è –≤—Å–µ–≥–¥–∞ –±—É–¥–µ—Ç –≤–ø–µ—Ä–µ–¥–∏! –ú—ã ‚Äì –Ω–∞—Ü–∏—è, —Å–ø–æ—Å–æ...",chatgpt-4o-latest,38,social,1


In [None]:
df2 = df2.rename(columns={"author": "source", "ai": "class"})
df2["genre"] = "social"
df2 = df2[["texts", "source", "word_counts", "genre", "class"]]

In [None]:
df2.head()

Unnamed: 0,texts,source,word_counts,genre,class
0,"> –°–Ω–æ–≤–∞ –≤ –Ω–æ–≤–æ—Å—Ç—è—Ö: ""—Å–≤–æ–∏ –∏–Ω—Ç–µ—Ä–µ—Å—ã –∑–∞—â–∏—â–∞–µ–º"". ...",chatgpt-4o-latest,34,social,1
1,–î–∞ –∫—Ç–æ –≤–æ–æ–±—â–µ –≤–µ—Ä–∏—Ç –≤ —ç—Ç–∏ —Å–∫–∞–∑–∫–∏ –ø—Ä–æ ¬´—Å–≤–µ—Ç–ª–æ–µ ...,chatgpt-4o-latest,38,social,1
2,"–ì–æ—Å–ø–æ–¥–∞, –¥—É–º–∞–π—Ç–µ –ø—Ä–µ–∂–¥–µ, —á–µ–º —Å–ø–æ—Ä–∏—Ç—å –ø—Ä–æ –ø–æ–ª–∏—Ç...",chatgpt-4o-latest,36,social,1
3,"–ù—É –≤–æ—Ç, –∞ –≥–æ–≤–æ—Ä–∏–ª–∏, —á—Ç–æ –æ—Ç–∑—ã–≤–æ–≤ –Ω–µ –±—É–¥–µ—Ç. –û–∫–∞–∑...",chatgpt-4o-latest,25,social,1
4,"–†–æ—Å—Å–∏—è –≤—Å–µ–≥–¥–∞ –±—É–¥–µ—Ç –≤–ø–µ—Ä–µ–¥–∏! –ú—ã ‚Äì –Ω–∞—Ü–∏—è, —Å–ø–æ—Å–æ...",chatgpt-4o-latest,38,social,1


In [None]:
# Now re-write the file.
df2.to_csv('data/original/ai/ai_social.csv', index=False, encoding='utf-8')