[Task 1 - Propaganda Identification](#task-1---propaganda-identification)  
[Task 2 - Coarse propaganda characterisation](#task-2---coarse-propaganda-characterisation)  
[Task 3 - Fine-grained propaganda characterisation](#task-3---fine-grained-propaganda-characterisation)  

In [44]:
import os
from pathlib import Path

from dotenv import load_dotenv
import pandas as pd

from prompts import (
    prompt_t1_dipromats,
    prompt_t2_dipromats,
    prompt_t3_dipromats,
)

print("Loaded .env:", load_dotenv("../../.env", override=True))
data_dir = Path(os.environ["PROJECT_DIR"]) / "data" / "host" / "dipromats_2023"

SPLIT = "val"
LANG = "es"

Loaded .env: True


# Task 1 - Propaganda Identification

In [45]:
print(prompt_t1_dipromats)

You are an excellent assistant at identifying propaganda in tweets. Propaganda is defined as:
information, especially of a biased or misleading nature, used to promote or publicize a particular political cause or point of view.

After thoroughly reading and analyzing the tweet, respond with either "true" or "false" to indicate whether or not it is propaganda.

Tweet:




In [46]:
import json
from datasets import Dataset

t1 = json.load(open(data_dir / f"{SPLIT}_t1_{LANG}.json"))

ds1 = Dataset.from_list(t1)

def format_t1(example):

    if "value" not in example:
        response = ""
    else:
        response = example["value"]

    return {
        "text": prompt_t1_dipromats + example["text"],
        "response": response
    }

ds1 = ds1.map(format_t1)


Map:   0%|          | 0/1224 [00:00<?, ? examples/s]

In [47]:
ds1.shuffle()[0]

{'test_case': 'DIPROMATS2023',
 'id': '1178',
 'country': 'China',
 'user_name': 'li_baorong',
 'tweet_type': 'Retweet',
 'tweet_id': 1235738204386365440,
 'UTC': '2020-03-06 01:25:21+00:00',
 'rts&fav': 0,
 'language': 'es',
 'text': 'You are an excellent assistant at identifying propaganda in tweets. Propaganda is defined as:\ninformation, especially of a biased or misleading nature, used to promote or publicize a particular political cause or point of view.\n\nAfter thoroughly reading and analyzing the tweet, respond with either "true" or "false" to indicate whether or not it is propaganda.\n\nTweet:\n\nLa quinta Visita Oficial que realizó el Presidente Chávez a China, fue en el año 2008, fue orador en la séptima Comisión Mixta de Alto Nivel entre China y Venezuela @Ivan_Zerpa @PresidencialVen https://t.co/CJFq9zsBqG',
 'value': 'false',
 'response': 'false'}

# Task 2 - Coarse propaganda characterisation

In [48]:
print(prompt_t2_dipromats)

You are an excellent assistant at categorizing propaganda in tweets. Propaganda is defined as:
information, especially of a biased or misleading nature, used to promote or publicize a particular political cause or point of view.

You will need to decide which of the following applies to the tweet. It could be one or more of the following.

1. Appeal to commonality. This could be related to the following:
  - Ad populum: the tweet appeals to the will, the tradition or the history of a community to support an argument. e.g. "The leadership of the #CPC is the choice of history and of the Chinese people."
  - Flag Waving: the tweet includes hyperbolic praise of a nation, worships a patriotic symbol, exhibits self-praise, or portrays someone as a hero. e.g. "The European Union is the best example, in the history of the world, of conflict resolution."
2. Discrediting the opponent. This could be related to the following:
  - Name Calling/Labelling: the author refers to someone or something wi

In [49]:
t2 = json.load(open(data_dir / f"{SPLIT}_t2_{LANG}.json"))

ds2 = Dataset.from_list(t2)

def format_t2(example):

    mapping = {"false": "5 not_propaganda"}

    if "value" not in example:
        response = ""
    else:
        labels = [mapping.get(x, x) for x in  example["value"]]
        response = "\n".join(sorted(labels))

    return {
        "text": prompt_t2_dipromats + example["text"],
        "response": response
    }


ds2 = ds2.map(format_t2)


Map:   0%|          | 0/1224 [00:00<?, ? examples/s]

In [50]:
import random


while True:
    x = random.choice(ds2)

    if "value" not in x:
        break

    if len(x["value"]) > 1:
        break

print(x)



# Task 3 - Fine-grained propaganda characterisation

In [51]:
print(prompt_t3_dipromats)

You are an excellent assistant at categorizing propaganda in tweets. Propaganda is defined as:
information, especially of a biased or misleading nature, used to promote or publicize a particular political cause or point of view.

You will need to decide which of the following applies to the tweet. It could be one or more of the following.

A. Appeal to commonality - Ad populum: the tweet appeals to the will, the tradition or the history of a community to support an argument. e.g. "The leadership of the #CPC is the choice of history and of the Chinese people."
B. Appeal to commonality - Flag Waving: the tweet includes hyperbolic praise of a nation, worships a patriotic symbol, exhibits self-praise, or portrays someone as a hero. e.g. "The European Union is the best example, in the history of the world, of conflict resolution."
C. Discrediting the opponent - Name Calling/Labelling: the author refers to someone or something with pejorative labels. e.g. "The #US is the gravest threat to gl

In [52]:
t3 = json.load(open(data_dir / f"{SPLIT}_t3_{LANG}.json"))

ds3 = Dataset.from_list(t3)

char2label = {"A": "appeal to commonality - ad populum",
"B": "appeal to commonality - flag waving",
"C": "discrediting the opponent - name calling",
"D": "discrediting the opponent - undiplomatic assertiveness/whataboutism",
"E": "discrediting the opponent - scapegoating",
"F": "discrediting the opponent - propaganda slinging",
"G": "discrediting the opponent - personal attacks",
"H": "discrediting the opponent - fear appeals",
"I": "discrediting the opponent - absurdity appeal",
"J": "discrediting the opponent - demonization",
"K": "discrediting the opponent - doubt",
"L": "discrediting the opponent - reductio ad hitlerum",
"M": "loaded language",
"N": "appeal to authority - appeal to false authority",
"O": "appeal to authority - bandwagoning",
"P": "not propaganda",
}
label2char = {v:k for k,v in char2label.items()}

def format_t3(example):

    if "value" not in example:
        response = ""
    else:
        labels = []

        for x in example["value"]:
            if x[0].isdigit():
                x = x[1:].strip()
            if x == "false":
                x = "not propaganda"

            if "(" in x:
                x = x.split("(")[0].strip()
            
            labels.append(label2char[x] + " " + x)

        response = "\n".join(sorted(labels))

    return {
        "text": prompt_t3_dipromats + example["text"],
        "response": response
    }

ds3 = ds3.map(format_t3)

Map:   0%|          | 0/1224 [00:00<?, ? examples/s]

In [53]:
while True:
    x = random.choice(ds3)
    if "value" not in x:
        break
    if len(x["value"]) > 1:
        break

print(x)



In [54]:
final_ds_list = []

final_cols = ["text", "response"]

ds1.remove_columns([x for x in ds1.column_names if x not in final_cols]).to_parquet(data_dir / f"{SPLIT}_t1_{LANG}_formatted.parquet")
ds2.remove_columns([x for x in ds2.column_names if x not in final_cols]).to_parquet(data_dir / f"{SPLIT}_t2_{LANG}_formatted.parquet")
ds3.remove_columns([x for x in ds3.column_names if x not in final_cols]).to_parquet(data_dir / f"{SPLIT}_t3_{LANG}_formatted.parquet")

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

7635038

In [55]:
t1[0]

{'test_case': 'DIPROMATS2023',
 'id': '634',
 'country': 'China',
 'user_name': 'embzhangrun',
 'tweet_type': 'Reply',
 'tweet_id': 1226495390872227841,
 'UTC': '2020-02-09 13:17:43+00:00',
 'rts&fav': 4,
 'language': 'es',
 'text': '@CristianRosaR @MeraCJ74 @PedroMCasals @StateDept En cuanto a la comunidad China laboriosa acá,  primero es centenaria,  o sea puede ser que llegó antes de sus ancestros y merece su respeto.  Segundo,  está muy unida con su patria y orgullosa del desarrollo de su patria.',
 'value': 'false'}

In [56]:
t2[0]

{'test_case': 'DIPROMATS2023',
 'id': '634',
 'country': 'China',
 'user_name': 'embzhangrun',
 'tweet_type': 'Reply',
 'tweet_id': 1226495390872227841,
 'UTC': '2020-02-09 13:17:43+00:00',
 'rts&fav': 4,
 'language': 'es',
 'text': '@CristianRosaR @MeraCJ74 @PedroMCasals @StateDept En cuanto a la comunidad China laboriosa acá,  primero es centenaria,  o sea puede ser que llegó antes de sus ancestros y merece su respeto.  Segundo,  está muy unida con su patria y orgullosa del desarrollo de su patria.',
 'value': ['false']}

In [57]:
t3[0]

{'test_case': 'DIPROMATS2023',
 'id': '634',
 'country': 'China',
 'user_name': 'embzhangrun',
 'tweet_type': 'Reply',
 'tweet_id': 1226495390872227841,
 'UTC': '2020-02-09 13:17:43+00:00',
 'rts&fav': 4,
 'language': 'es',
 'text': '@CristianRosaR @MeraCJ74 @PedroMCasals @StateDept En cuanto a la comunidad China laboriosa acá,  primero es centenaria,  o sea puede ser que llegó antes de sus ancestros y merece su respeto.  Segundo,  está muy unida con su patria y orgullosa del desarrollo de su patria.',
 'value': ['false']}

In [58]:
for x, y, z in zip(t1, t2, t3):
    val = x["text"]
    if not all([y["text"] == val, z["text"] == val]):
        print(x["text"], y["text"], z["text"])