# Week 7: Propaganda Detection

This week we will be looking at the propaganda detection task (Da San Martino 2019) and developing a baseline model for technique classification.

We will be working with an adapted version of the dataset from the paper.  In particular, I have
* reduced the number of propaganda techniques
* randomly sampled sentences labelled with a particular technique
* randomly sampled sentences (from the same original articles) which do not contain propaganda
* reformatted the data so that the snippets can be "read" in the context of the sentence by inserting \<BOS\> and \<EOS\> tags.

Let's load it up in a pandas dataframe so that we can look at some examples to illustrate that last point.

In [3]:
import os
parentdir = "/Users/finpearson/Desktop/Github/ANLE---Python-/Week7/lab7resources/propaganda_dataset_v2"
train_file= "propaganda_train.tsv"
train_path=os.path.join(parentdir,train_file)


In [4]:
import pandas as pd
train_df=pd.read_csv(train_path,delimiter="\t",quotechar='|')
train_df.head(20)

Unnamed: 0,label,tagged_in_context
0,not_propaganda,"No, <BOS> he <EOS> will not be confirmed."
1,not_propaganda,This declassification effort <BOS> won’t make ...
2,flag_waving,"""The Obama administration misled the <BOS> Ame..."
3,not_propaganda,“It looks like we’re capturing the demise of t...
4,not_propaganda,"<BOS> Location: Westerville, Ohio <EOS>"
5,loaded_language,"Hitler <BOS> annihilated <EOS> 400,000 Germans..."
6,not_propaganda,A federal judge on Monday ordered U.S. immigra...
7,not_propaganda,<BOS> Kirstjen Nielsen (@SecNielsen) <EOS> Nov...
8,doubt,"As noted above, at this point literally every ..."
9,not_propaganda,Britain doesn't need more hate even just for a...


We can see in the cell above that there are 2 columns.  The first column is the label i.e., the propaganda technique or the label "not_propaganda".  The 2nd column contains the text.  Within the text we can see the special \<BOS\> and \<EOS\> tags which indicate the propaganda snippet.

Let's have a look at some examples from the **loaded_language** class.

In [7]:
train_df[train_df["label"]=="not_propaganda"].head(20) # This is used to pick out rows who's label is whatever is second half of equals sign

Unnamed: 0,label,tagged_in_context
0,not_propaganda,"No, <BOS> he <EOS> will not be confirmed."
1,not_propaganda,This declassification effort <BOS> won’t make ...
3,not_propaganda,“It looks like we’re capturing the demise of t...
4,not_propaganda,"<BOS> Location: Westerville, Ohio <EOS>"
6,not_propaganda,A federal judge on Monday ordered U.S. immigra...
7,not_propaganda,<BOS> Kirstjen Nielsen (@SecNielsen) <EOS> Nov...
9,not_propaganda,Britain doesn't need more hate even just for a...
11,not_propaganda,"Ironically, even in doing this he is <BOS> lik..."
13,not_propaganda,During the term of the Assassination <BOS> Rec...
15,not_propaganda,President Trump ordered that the relevant docu...


In the first loaded_language example (row 5 of the original dataframe), we can see that there is a single word *annihilated* which is between \<BOS\> (**beginning of span**) and \<EOS\> (**end of span**).

### Exercise 1: Exploratory Data Analysis
Write code and plot appropriate graphs to visualise each of the following questions.

a) How many samples are there for each class?

b) What is the average length of sentence for each class?

c) What is the average length of propaganda snippet for each class? 



In [8]:
class_counts = train_df['label'].value_counts()
print(class_counts)

label
not_propaganda               1269
exaggeration,minimisation     170
name_calling,labeling         166
causal_oversimplification     165
loaded_language               161
repetition                    160
doubt                         157
appeal_to_fear_prejudice      157
flag_waving                   155
Name: count, dtype: int64


In [11]:
import numpy as np

# Calculate the length of each sentence
train_df['sentence_length'] = train_df['tagged_in_context'].apply(lambda x: len(x.split()))

# Calculate the average length of sentence for each class
avg_sentence_length = train_df.groupby('label')['sentence_length'].mean()
print(avg_sentence_length)

label
appeal_to_fear_prejudice     31.171975
causal_oversimplification    34.896970
doubt                        33.012739
exaggeration,minimisation    32.705882
flag_waving                  32.251613
loaded_language              29.465839
name_calling,labeling        35.078313
not_propaganda               22.874704
repetition                   26.262500
Name: sentence_length, dtype: float64


In [13]:
# Function to extract propaganda snippet length
def extract_propaganda_snippet_length(text):
    snippet = text.split('<BOS>')[-1].split('<EOS>')[0].strip()
    return len(snippet.split())

# Apply the function to calculate propaganda snippet length
train_df['propaganda_snippet_length'] = train_df['tagged_in_context'].apply(extract_propaganda_snippet_length)

# Calculate the average length of propaganda snippet for each class
avg_propaganda_snippet_length = train_df.groupby('label')['propaganda_snippet_length'].mean()
print(avg_propaganda_snippet_length)

label
appeal_to_fear_prejudice     17.694268
causal_oversimplification    21.418182
doubt                        20.617834
exaggeration,minimisation     7.705882
flag_waving                  10.735484
loaded_language               3.496894
name_calling,labeling         4.295181
not_propaganda                6.441292
repetition                    2.862500
Name: propaganda_snippet_length, dtype: float64


### Exercise 2: Sentence Level Binary classification

Build a simple classifier (e.g., Naïve Bayes or Logistic Regression) which can take a sentence and predict whether it contains propaganda or not.  Things you will need to think about

* making a binary "propaganda" or "not_propaganda" label
* splitting the data into training and validation
* making a bag-of-words representation of each sentence.  This could be a dictionary where the keys are the words and the values are the frequencies within the sentence.  
* the implementation of the classifier itself.  You are not expected to build this yourself.  A good one to use would be the multinomialNB classifier in scikit-learn

It's worth thinking about the input format expected by the classifier before pre-processing the data.  Code to import and use the scikit-learn multinomialNB classifier is below for consideration

In [14]:
#This gives us some random toy data.  
#In this toy data there are 10 data points (e.g., sentences) 
# each sentence is represented as a vector of 100 values
# each value could be the frequency of a particular word in the vocab.  The max frequency here is 6
# these are stored in X
# there are 2 possible labels (0 and 1) which are stored in Y

import numpy as np
rng=np.random.RandomState()
X = rng.randint(6,size=(10,100))
Y = rng.randint(2, size=10)

print(X)
print(Y)

[[3 5 4 1 0 5 3 1 3 0 5 4 5 2 1 5 0 4 1 4 0 3 3 1 5 5 2 3 3 3 2 3 5 5 1 3
  4 4 0 3 4 5 2 5 2 5 4 5 4 5 1 3 1 2 0 2 3 3 1 2 3 1 4 1 4 3 3 0 0 3 5 3
  3 4 5 4 0 2 3 3 1 0 4 3 4 2 1 0 1 1 1 5 3 5 0 1 2 1 4 1]
 [4 1 2 3 1 2 2 1 5 4 1 4 0 1 0 3 3 2 2 2 4 5 5 5 4 4 1 0 0 5 2 0 0 1 0 0
  0 1 3 3 4 5 0 3 4 1 0 3 0 5 1 2 2 3 4 1 3 0 1 0 1 1 3 2 1 4 0 0 2 3 0 4
  4 1 2 5 1 3 0 0 4 1 3 2 1 4 0 2 5 2 2 2 0 5 0 1 2 1 5 0]
 [2 3 4 1 3 5 5 3 5 3 3 0 0 0 0 2 2 2 2 5 2 1 0 3 0 5 3 5 4 1 3 2 5 5 3 5
  4 3 0 2 1 3 2 4 3 0 2 5 1 5 1 4 4 4 2 2 5 5 2 4 1 2 5 4 5 4 4 1 0 3 3 5
  4 5 0 1 4 0 5 5 3 3 1 3 4 1 5 1 1 2 3 5 5 3 1 0 2 5 4 1]
 [3 5 0 4 4 3 3 3 2 5 3 3 4 2 1 1 4 0 3 3 1 4 5 1 4 2 3 1 2 5 4 4 4 5 5 3
  3 0 3 4 4 4 2 0 3 5 3 0 4 3 5 3 0 1 0 1 5 4 1 1 2 0 4 5 2 5 1 5 3 5 1 3
  3 2 5 3 1 4 0 3 2 2 2 1 3 0 2 2 2 1 4 1 1 4 3 3 5 1 5 4]
 [4 5 0 1 0 3 0 0 4 1 5 2 5 3 4 2 0 2 4 0 3 5 0 2 4 5 4 5 1 1 3 5 2 1 2 3
  4 2 3 1 4 2 4 4 2 1 4 2 4 1 0 0 2 2 4 5 2 3 4 0 2 3 2 2 3 0 0 0 4 0 4 2
  2 4 3 0 4 1 5 2 0 1 2 

In [16]:
#we can give this as input to the MultinomialNB classifier using the fit method
from sklearn.naive_bayes import MultinomialNB

classifier=MultinomialNB()
classifier.fit(X,Y)


In [22]:
#we can predict the value for any datapoint
#here we are making up some more random points with random labels so I wouldn't expect particularly high accuracy!

X = rng.randint(6,size=(5,100))
Y = rng.randint(2, size=5)
print(Y)
classifier.predict(X)


[1 0 0 1 1]


array([0, 0, 0, 1, 0])

So imagine you have sentence representations and labels as follows.  We need to generate vectors for them where each column corresponds to a particular word in the vocabulary

In [23]:
toy_training_data=[({"everyone":1,"hates":1,"vectors":1},1),({"vectors":1,"are":1,"useful":1},0)]
Xdicts,Y=zip(*toy_training_data)


We could write some code to turn the Xdicts into vectors (you first need to work out what the vocab is and assign an index to each vocab item).  Or we can use scikit-learn's CountVectorizer directly on the texts.

In [24]:
toy_training_data=[("everyone hates vectors",1),("vectors are useful",0)]
Xsents,Y=zip(*toy_training_data)

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer()
vectorizer.fit(Xsents)

# Printing the identified Unique words along with their indices
print("Vocabulary: ", vectorizer.vocabulary_)
 
# Encode the sents
Xvectors = vectorizer.transform(Xsents)
 
print(Xvectors)

Vocabulary:  {'everyone': 1, 'hates': 2, 'vectors': 4, 'are': 0, 'useful': 3}
  (0, 1)	1
  (0, 2)	1
  (0, 4)	1
  (1, 0)	1
  (1, 3)	1
  (1, 4)	1


In [26]:
classifier=MultinomialNB()
classifier.fit(Xvectors,Y)

In [27]:
classifier.predict(Xvectors[1])

array([0])

What happens if some new sentences have words not in the vocabulary?  We can see here that they are just ignored by the vectorization process - this is fine as they are not going to help the classifier as they are unknown

In [28]:
toy_test_data=[("everyone hates useful vectors",1),("vectors are really useful",0)]
Xsents,Y=zip(*toy_test_data)
testVectors=vectorizer.transform(Xsents)

print(testVectors)


  (0, 1)	1
  (0, 2)	1
  (0, 3)	1
  (0, 4)	1
  (1, 0)	1
  (1, 3)	1
  (1, 4)	1


In [29]:
classifier.predict(testVectors)

array([1, 0])

Lets go back to a modified version of the exercise which assumes we are going to use CountVectorizer and MultinomialNB

Build a MultinomialNB classifier which can take a sentence and predict whether it contains propaganda or not.  Things you will need to think about

* making a binary "propaganda" or "not_propaganda" label
* splitting the data into training and validation
* making a bag-of-words representation of each sentence using CountVectorizer
* Training a MultinomialNB classifier on the training data and evaluating it on the validation data

In [47]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Step 1: Binary Label Creation
train_df['binary_label'] = train_df['label'].apply(lambda x: 1 if x != 'not_propaganda' else 0)

# Step 2: Data Splitting
X_train, X_val, y_train, y_val = train_test_split(train_df['tagged_in_context'], train_df['binary_label'], test_size=0.2, random_state=42)

# Step 3: Bag-of-Words Representation
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_train)
X_val_bow = vectorizer.transform(X_val)
#print(X_val_bow[20])

# Step 4: Training MultinomialNB
classifier1 = MultinomialNB()
classifier1.fit(X_train_bow, y_train)

# Step 5: Evaluation
predictions = classifier1.predict(X_val_bow)
#print(predictions)
#print(y_val)
accuracy = accuracy_score(y_val, predictions)
print("Accuracy:", accuracy)
print(classification_report(y_val, predictions))


Accuracy: 0.677734375
              precision    recall  f1-score   support

           0       0.76      0.56      0.64       265
           1       0.63      0.81      0.71       247

    accuracy                           0.68       512
   macro avg       0.69      0.68      0.67       512
weighted avg       0.69      0.68      0.67       512



### Exercise 3: Snippet level binary classification
Repeat exercise 2 but train and test on propaganda snippets rather than whole sentences.


In [48]:
# Step 1: Data Preprocessing
train_df['propaganda_snippet'] = train_df['tagged_in_context'].apply(lambda x: x.split('<BOS>')[-1].split('<EOS>')[0].strip())
print(train_df['propaganda_snippet'])

# Step 2: Feature Extraction
X_train_snippet, X_val_snippet, y_train, y_val = train_test_split(train_df['propaganda_snippet'], train_df['binary_label'], test_size=0.2, random_state=42)

vectorizer = CountVectorizer()
X_train_bow_snippet = vectorizer.fit_transform(X_train_snippet)
X_val_bow_snippet = vectorizer.transform(X_val_snippet)

# Step 3: Model Training and Evaluation
classifier2 = MultinomialNB()
classifier2.fit(X_train_bow_snippet, y_train)

predictions = classifier2.predict(X_val_bow_snippet)
accuracy = accuracy_score(y_val, predictions)
print("Accuracy:", accuracy)
print(classification_report(y_val, predictions))


0                                                      he
1       won’t make things any worse than they are for ...
2                                         American people
3                                                     and
4                             Location: Westerville, Ohio
                              ...                        
2555                            We support and appreciate
2556    capacity to check whether Iran was conducting ...
2557                               one for those recently
2558    the law of gradualness not the gradualness of ...
2559                                              selfish
Name: propaganda_snippet, Length: 2560, dtype: object
Accuracy: 0.611328125
              precision    recall  f1-score   support

           0       0.74      0.39      0.51       265
           1       0.56      0.85      0.68       247

    accuracy                           0.61       512
   macro avg       0.65      0.62      0.59       512
weighted avg  

### Exercise 4: Mixing snippets and sentence

* How well does your sentence level classifier work on snippets?  
* How well does snippet level classifier work on sentences?
* What about if you train on both sentences and snippets?

In [51]:
predictions1 = classifier1.predict(X_val_bow)
predictions2 = classifier2.predict(X_val_bow_snippet)

accuracy1 = accuracy_score(y_val, predictions1)
print("Accuracy:", accuracy1)
print(classification_report(y_val, predictions1))

accuracy2 = accuracy_score(y_val, predictions2)
print("Accuracy:", accuracy2)
print(classification_report(y_val, predictions2))

Accuracy: 0.677734375
              precision    recall  f1-score   support

           0       0.76      0.56      0.64       265
           1       0.63      0.81      0.71       247

    accuracy                           0.68       512
   macro avg       0.69      0.68      0.67       512
weighted avg       0.69      0.68      0.67       512

Accuracy: 0.611328125
              precision    recall  f1-score   support

           0       0.74      0.39      0.51       265
           1       0.56      0.85      0.68       247

    accuracy                           0.61       512
   macro avg       0.65      0.62      0.59       512
weighted avg       0.65      0.61      0.59       512



### Extensions

1) Can you use cross-validation to evaluate your classifiers rather than a single training/development split

2) Can you use a different type of classifier and / or feature representation.  E.g., logistic regression where the feature values are tf-idf values rather than frequencies?

3) Can you carry out multi-class classification to identify the propaganda technique used?  

4) Does it help to first use your binary classifier to decide whether there is any propaganda or not?
