[Task 1 - Propaganda Identification](#task-1---propaganda-identification)  
[Task 2 - Coarse propaganda characterisation](#task-2---coarse-propaganda-characterisation)  
[Task 3 - Fine-grained propaganda characterisation](#task-3---fine-grained-propaganda-characterisation)  

In [92]:
import os
from pathlib import Path

from dotenv import load_dotenv
import pandas as pd

from prompts import (
    prompt_t1_dipromats,
    prompt_t2_dipromats,
    prompt_t3_dipromats,
)

print("Loaded .env:", load_dotenv("../../.env", override=True))
data_dir = Path(os.environ["PROJECT_DIR"]) / "data" / "host" / "dipromats_2023"

SPLIT = "val"
LANG = "en"

Loaded .env: True


# Task 1 - Propaganda Identification

In [93]:
print(prompt_t1_dipromats)

You are an excellent assistant at identifying propaganda in tweets. Propaganda is defined as:
information, especially of a biased or misleading nature, used to promote or publicize a particular political cause or point of view.

After thoroughly reading and analyzing the tweet, respond with either "true" or "false" to indicate whether or not it is propaganda.

Tweet:




In [94]:
import json
from datasets import Dataset

t1_en = json.load(open(data_dir / f"{SPLIT}_t1_{LANG}.json"))

ds1 = Dataset.from_list(t1_en)

def format_t1(example):
    return {
        "text": prompt_t1_dipromats + example["text"],
        "response": example["value"]
    }

ds1 = ds1.map(format_t1)


Map:   0%|          | 0/1682 [00:00<?, ? examples/s]

In [95]:
ds1.shuffle()[0]

{'test_case': 'DIPROMATS2023',
 'id': '6263',
 'country': 'USA',
 'user_name': 'usaembassyinoz',
 'tweet_type': 'quoted',
 'tweet_id': 1297745865340145664,
 'UTC': '2020-08-24 04:01:39+00:00',
 'rts&fav': 267,
 'language': 'en',
 'text': 'You are an excellent assistant at identifying propaganda in tweets. Propaganda is defined as:\ninformation, especially of a biased or misleading nature, used to promote or publicize a particular political cause or point of view.\n\nAfter thoroughly reading and analyzing the tweet, respond with either "true" or "false" to indicate whether or not it is propaganda.\n\nTweet:\n\nAfter much community consultation, the U.S. Mission Australia will be adopting the ee-mew pronunciation in all future emu-related alliance matters #USwithAUS https://t.co/UgvtKOFoiD',
 'value': 'false',
 'response': 'false'}

# Task 2 - Coarse propaganda characterisation

In [96]:
print(prompt_t2_dipromats)

You are an excellent assistant at categorizing propaganda in tweets. Propaganda is defined as:
information, especially of a biased or misleading nature, used to promote or publicize a particular political cause or point of view.

You will need to decide which of the following applies to the tweet. It could be one or more of the following.

1. Appeal to commonality. This could be related to the following:
  - Ad populum: the tweet appeals to the will, the tradition or the history of a community to support an argument. e.g. "The leadership of the #CPC is the choice of history and of the Chinese people."
  - Flag Waving: the tweet includes hyperbolic praise of a nation, worships a patriotic symbol, exhibits self-praise, or portrays someone as a hero. e.g. "The European Union is the best example, in the history of the world, of conflict resolution."
2. Discrediting the opponent. This could be related to the following:
  - Name Calling/Labelling: the author refers to someone or something wi

In [97]:
t2_en = json.load(open(data_dir / f"{SPLIT}_t2_{LANG}.json"))

ds2 = Dataset.from_list(t2_en)

def format_t2(example):

    mapping = {"false": "5 not_propaganda"}

    labels = [mapping.get(x, x) for x in  example["value"]]


    return {
        "text": prompt_t2_dipromats + example["text"],
        "response": "\n".join(sorted(labels))
    }


ds2 = ds2.map(format_t2)


Map:   0%|          | 0/1682 [00:00<?, ? examples/s]

In [98]:
import random


while True:
    x = random.choice(ds2)
    if len(x["value"]) > 1:
        break

print(x)



# Task 3 - Fine-grained propaganda characterisation

In [99]:
print(prompt_t3_dipromats)

You are an excellent assistant at categorizing propaganda in tweets. Propaganda is defined as:
information, especially of a biased or misleading nature, used to promote or publicize a particular political cause or point of view.

You will need to decide which of the following applies to the tweet. It could be one or more of the following.

A. Appeal to commonality - Ad populum: the tweet appeals to the will, the tradition or the history of a community to support an argument. e.g. "The leadership of the #CPC is the choice of history and of the Chinese people."
B. Appeal to commonality - Flag Waving: the tweet includes hyperbolic praise of a nation, worships a patriotic symbol, exhibits self-praise, or portrays someone as a hero. e.g. "The European Union is the best example, in the history of the world, of conflict resolution."
C. Discrediting the opponent - Name Calling/Labelling: the author refers to someone or something with pejorative labels. e.g. "The #US is the gravest threat to gl

In [100]:
t3_en = json.load(open(data_dir / "train_t3_en.json"))

ds3 = Dataset.from_list(t3_en)

char2label = {"A": "appeal to commonality - ad populum",
"B": "appeal to commonality - flag waving",
"C": "discrediting the opponent - name calling",
"D": "discrediting the opponent - undiplomatic assertiveness/whataboutism",
"E": "discrediting the opponent - scapegoating",
"F": "discrediting the opponent - propaganda slinging",
"G": "discrediting the opponent - personal attacks",
"H": "discrediting the opponent - fear appeals",
"I": "discrediting the opponent - absurdity appeal",
"J": "discrediting the opponent - demonization",
"K": "discrediting the opponent - doubt",
"L": "discrediting the opponent - reductio ad hitlerum",
"M": "loaded language",
"N": "appeal to authority - appeal to false authority",
"O": "appeal to authority - bandwagoning",
"P": "not propaganda",
}
label2char = {v:k for k,v in char2label.items()}

def format_t3(example):

    labels = []

    for x in example["value"]:
        if x[0].isdigit():
            x = x[1:].strip()
        if x == "false":
            x = "not propaganda"

        if "(" in x:
            x = x.split("(")[0].strip()
        
        labels.append(label2char[x] + " " + x)

    return {
        "text": prompt_t3_dipromats + example["text"],
        "response": "\n".join(sorted(labels))
    }

ds3 = ds3.map(format_t3)

Map:   0%|          | 0/6726 [00:00<?, ? examples/s]

In [101]:
while True:
    x = random.choice(ds3)
    if len(x["value"]) > 1:
        break

print(x)



In [102]:
final_ds_list = []

final_cols = ["text", "response"]

ds1.remove_columns([x for x in ds1.column_names if x not in final_cols]).to_parquet(data_dir / f"{SPLIT}_t1_{LANG}_formatted.parquet")
ds2.remove_columns([x for x in ds2.column_names if x not in final_cols]).to_parquet(data_dir / f"{SPLIT}_t2_{LANG}_formatted.parquet")
ds3.remove_columns([x for x in ds3.column_names if x not in final_cols]).to_parquet(data_dir / f"{SPLIT}_t3_{LANG}_formatted.parquet")

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/7 [00:00<?, ?ba/s]

42015260