# Multinomial Multi-Label Logistic Regression for PCL Detection

Our first try will be to detect Patronizing and Condescending Language (PCL) using a Multi-Label Logistic Regression Approach. Logistic Regression is one simple approach for classification that was not previously explored in the paper, which is why we chose this as a first experiment.

# Loading Data

First we will load the dontpatronizeme-dataset from our previously saved train-/test-split, so that we have nicely stratified samples for learning and testing.

In [1]:
from dontpatronizeme.ext_dont_patronize_me import DontPatronizeMe

# path to training set, path to test set
dpm = DontPatronizeMe('./data/dpm_train.csv', './data/dpm_test.csv')

In [2]:
dpm.load_task2()
dpm.train_task2_df.head()

Unnamed: 0,par_id,art_id,text,keyword,country,label,higher level label
1,1279,@@7896098,Pope Francis washed and kissed the feet of Mus...,refugee,ng,"[0, 1, 0, 0, 0, 0, 0]","[1, 0, 0]"
3,4063,@@3002894,"""Budding chefs , like """" Fred """" , """" Winston ...",in-need,ie,"[1, 0, 0, 1, 1, 1, 0]","[1, 1, 1]"
6,4177,@@930041,The Word of God is truth that 's living and ab...,hopeless,us,"[1, 0, 0, 0, 0, 1, 0]","[1, 0, 1]"
7,3963,@@18867357,"Chantelle Owens , Mrs Planet 2016 , hosted the...",in-need,za,"[1, 1, 0, 0, 0, 1, 0]","[1, 0, 1]"
8,2001,@@14012804,t is remiss not to mention here that not all s...,poor-families,tz,"[0, 0, 1, 0, 0, 0, 0]","[0, 1, 0]"


We can see that the data has been loaded correctly. Most important for training are the texts, which we will need to convert to embeddings next, as well as the normal labels and the higher level labels of the PCL taxonomy.

In [3]:
dpm.load_test()
dpm.test_set_df.head()

Unnamed: 0,par_id,art_id,text,keyword,country,label,higher level label
0,4046,@@14767805,We also know that they can benefit by receivin...,hopeless,us,"[1, 0, 0, 1, 0, 0, 0]","[1, 1, 0]"
2,8330,@@17252299,Many refugees do n't want to be resettled anyw...,refugee,ng,"[0, 0, 1, 0, 0, 0, 0]","[0, 1, 0]"
4,4089,@@25597822,"""In a 90-degree view of his constituency , one...",homeless,pk,"[1, 0, 0, 0, 0, 0, 0]","[1, 0, 0]"
5,432,@@15802146,He depicts demonstrations by refugees at the b...,refugee,nz,"[0, 0, 0, 0, 0, 1, 0]","[0, 0, 1]"
9,369,@@15636898,""""""" People do n't understand the hurt , people...",women,ie,"[1, 0, 1, 1, 0, 1, 0]","[1, 1, 1]"


# Converting Paragraphs into Embeddings

For training our logistic regression classifier, we need to convert the input sentences into sentence embeddings. More specifically, we need one vector per sentence that we can feed into our classifier.

In [4]:
paragraphs = dpm.train_task2_df.loc[:, 'text']
paragraphs

1      Pope Francis washed and kissed the feet of Mus...
3      "Budding chefs , like "" Fred "" , "" Winston ...
6      The Word of God is truth that 's living and ab...
7      Chantelle Owens , Mrs Planet 2016 , hosted the...
8      t is remiss not to mention here that not all s...
                             ...                        
987    Citing the fact that these kids who died at Go...
988    Fern ? ndez was a well-known philanthropist wh...
989    Touched much by their plight , Commanding Offi...
990    She reiterated her ministry 's commitment to p...
991    Preaching the sermon , the Dean of the St. Pet...
Name: text, Length: 815, dtype: object

Here we use scikit-learns TfidfVectorizer, which does this job nicely. We can also get our vocabulary from this Vectorizer for the training data, as we will need the same shape of vectors for the test data as well.

Another approach would have been word2vec-like sentence embeddings or simply a bag-of-words-approach, but tf-idf already weighs the words according to their frequency, which might help in distinguishing content of texts better.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(paragraphs)
print(X_train.shape)
vocabulary = vectorizer.get_feature_names_out()

(815, 6859)


# Training

Now for training, we need to extract the labels from the dataframe and train a classifier on them. For comparison, we start with a simple multiclass logistic regression, that is supposed to predict the entire set of labels as one class (as a string), to see if the multilabel approach has advantages here.

In [6]:
import ast, numpy as np
Y_train = dpm.train_task2_df.loc[:, 'label'].to_numpy()
Y_train = np.array([np.array(ast.literal_eval(x)) for x in Y_train])
Y_train[-5:]

array([[1, 0, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0]])

In [7]:
X_train.shape

(815, 6859)

In [8]:
# normal logistic regression
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, dpm.train_task2_df.loc[:, 'label'])
classifier.score(X_train,dpm.train_task2_df.loc[:, 'label'])

0.4404907975460123

We can see that the normal (multiclass) logistic regression classifier performs okayish on the training data. Ideally a score should be better, of course, but as we will see in the next section the normal logistic regression performs even worse on the test data.

In [9]:
# multilabel logistic regression
import numpy as np
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression

clf = MultiOutputClassifier(estimator= LogisticRegression(max_iter = 500)).fit(X_train, Y_train)
clf.score(X_train, Y_train)

0.3006134969325153

Our multilabel logistic regression classifier is able to predict all 7 PCL categories independently of each other, thanks to the MultiOutputClassifier-Wrapper around the logistic regression. We can see that it performs worse on the train data than the normal logistic regression does, but we will explore the real results in the next section.

# Testing

Let us try out both of our classifiers on the test set now. For that we need to convert the test paragraphs into the same embedding shape as our train paragraphs, which the TfidfVectorizer also does for us:

In [10]:
test_paragraphs = dpm.test_set_df.loc[:, 'text']
X_test = vectorizer.transform(test_paragraphs)
print(X_test.shape)

(177, 6859)


In [11]:
Y_test = dpm.test_set_df.loc[:, 'label'].to_numpy()
Y_test = np.array([np.array(ast.literal_eval(x)) for x in Y_test])
Y_test[-5:]

array([[0, 0, 0, 0, 1, 1, 0],
       [0, 0, 0, 0, 1, 1, 0],
       [0, 1, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 0, 0]])

In [12]:
# normal logistic regression
classifier.score(X_test,dpm.test_set_df.loc[:, 'label'])

0.14124293785310735

In [13]:
# multilabel logistic regression
clf.score(X_test, Y_test)

0.1751412429378531

We can observe that both scores drop significantly compared to our preliminary score results, but given the test data the multilabel classifier scores higher than the multiclass classifier. To see what our multilabel approach really does, we will now dive in into the evaluation of the predictions on every category.

# Evaluating

In order to properly evaluate the multilabel classification results we will use our extended evaluation script that will show us accuracy, precision, recall and F1 measure for each PCL category separately.

In [14]:
import dontpatronizeme.ext_evaluation
Y_pred = clf.predict(X_test)
Y_pred.sum(axis=0)

array([176,   0,   0,   1,   0,  61,   0])

In [15]:
# saving predictions for analysis
import pickle
pklobj = open('data/pred_logreg.obj','wb')
pickle.dump(Y_pred, pklobj)
pklobj.close()

It is striking to note that there are four categories which where all predicted to not appear anywhere in the data. A fifth category was only predicted to occur once. Let's see what our test script has to say to that:

In [16]:
dontpatronizeme.ext_evaluation.evaluate(Y_test, Y_pred, 'll')

Unbalanced Power Relations
Accuracy: 0.8022598870056498
Precision: 0.8068181818181818
Recall: 0.993006993006993
F1 Score: 0.890282131661442
Confusion Matrix: (tn, fp / fn, tp)
[[  0  34]
 [  1 142]]
--------------------------------------------------
Shallow Solution
Accuracy: 0.7796610169491526
Precision: 0.0
Recall: 0.0
F1 Score: 0.0
Confusion Matrix: (tn, fp / fn, tp)
[[138   0]
 [ 39   0]]
--------------------------------------------------
Presupposition
Accuracy: 0.7457627118644068
Precision: 0.0
Recall: 0.0
F1 Score: 0.0
Confusion Matrix: (tn, fp / fn, tp)
[[132   0]
 [ 45   0]]
--------------------------------------------------
Authority Voice
Accuracy: 0.7457627118644068
Precision: 1.0
Recall: 0.021739130434782608
F1 Score: 0.042553191489361694
Confusion Matrix: (tn, fp / fn, tp)
[[131   0]
 [ 45   1]]
--------------------------------------------------
Metaphor
Accuracy: 0.7796610169491526
Precision: 0.0
Recall: 0.0
F1 Score: 0.0
Confusion Matrix: (tn, fp / fn, tp)
[[138   0]
 [

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


On the one hand, we can see that none of the categories Shallow Solution, Presupposition, Metaphor and The poorer the merrier have a single positive prediction. This also results in 0s for precision, recall and F1 score for these categories. Nonetheless, they all have a very high accuracy (all above 0.75) which suggests on the other hand that the model has learned that there are very few positive examples of this category to begin with and that it fares better if it categorically labels all of them as not containing this category. This can also be seen quite well in the mini confusion matrices for each category, which shows that most samples are true negatives and therefore naturally increase the accuracy.

The exact opposite is happening to the category Unbalanced Power Relations. Here, the model predicts that every paragraph contains PCL of this category and scores very high with that generalisation in all four measures.

For the category Authority Voice, the model seems to be very sure about one sample being positive (maybe linked to a single word that only exists in this paragraph and has been labeled in the training set with the same category). This way, we get a precision of 1, but the model is definitely lacking in recall and thus in its F1 measure.

At last we have Compassion. This is the only category where the predictions are spread out across positive and negative samples and we can see that the model still seems to have learned to separate the occurences better than chance could have.

So, even an approach as simple as logistic regression leads to acceptable results regarding the accuracy of the predictions, but it overgeneralises so much to fit the given dataset that it constantly predicts one often occurring category and never predicts four little occurring categories. With this in mind, all further results have to be handled with a lot of care.

# Training on higher level labels

Now onto something different. We have previously prepared the higher level labels of the taxonomy to match to the annotated lower level labels for each paragraph. We want to see whether predicting these labels is easier, as there are fewer labels to choose from while there are more samples to train on. We will use the already embedded sentences from the previous training, and only swap out the labels to train two new classifiers: one being again the "normal" logistic regression and the other being the multilabel logistic regression.

In [17]:
X_train_hl = X_train
Y_train_hl = dpm.train_task2_df.loc[:, 'higher level label'].to_numpy()
Y_train_hl = np.array([np.array(ast.literal_eval(x)) for x in Y_train_hl])
Y_train_hl[-5:]

array([[1, 0, 1],
       [1, 0, 0],
       [1, 0, 1],
       [1, 0, 0],
       [0, 1, 0]])

In [18]:
# normal logistic regression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, dpm.train_task2_df.loc[:, 'higher level label'])
classifier.score(X_train,dpm.train_task2_df.loc[:, 'higher level label'])

0.750920245398773

In [19]:
# multilabel logistic regression
clf = MultiOutputClassifier(estimator= LogisticRegression(max_iter = 500)).fit(X_train_hl, Y_train_hl)
clf.score(X_train_hl, Y_train_hl)

0.6404907975460122

We can see that both score a lot higher on the train data itself, let's see the scores for our test data:

In [20]:
X_test_hl = X_test
classifier.score(X_test_hl, dpm.test_set_df.loc[:, 'higher level label'])

0.2598870056497175

In [21]:
Y_test_hl = dpm.test_set_df.loc[:, 'higher level label'].to_numpy()
Y_test_hl = np.array([np.array(ast.literal_eval(x)) for x in Y_test_hl])
Y_test_hl[-5:]

array([[0, 0, 1],
       [0, 0, 1],
       [1, 0, 1],
       [1, 0, 1],
       [1, 0, 0]])

In [22]:
clf.score(X_test_hl, Y_test_hl)

0.3954802259887006

We can observe a similar pattern to the lower level predictions: the multilabel approach wins on the test data, although both models suffer drastic losses in terms of performance on the test set. Let's evaluate the categories:

In [23]:
Y_pred_hl = clf.predict(X_test_hl)
Y_pred_hl.sum(axis=0)

array([176,  24, 112])

Predictions are spread out more evenly, although the 176 samples predicted to be of the first category are seemingly dominated by the lower level category Unbalanced Power Relations, as this predicted the same amount of paragraphs to contain this category.

In [24]:
dontpatronizeme.ext_evaluation.evaluate(Y_test_hl, Y_pred_hl, 'hl')

The saviour
Accuracy: 0.8418079096045198
Precision: 0.8465909090909091
Recall: 0.9933333333333333
F1 Score: 0.9141104294478527
Confusion Matrix: (tn, fp / fn, tp)
[[  0  27]
 [  1 149]]
--------------------------------------------------
The expert
Accuracy: 0.6666666666666666
Precision: 0.7916666666666666
Recall: 0.2602739726027397
F1 Score: 0.39175257731958757
Confusion Matrix: (tn, fp / fn, tp)
[[99  5]
 [54 19]]
--------------------------------------------------
The poet
Accuracy: 0.7062146892655368
Precision: 0.7410714285714286
Recall: 0.7830188679245284
F1 Score: 0.761467889908257
Confusion Matrix: (tn, fp / fn, tp)
[[42 29]
 [23 83]]
--------------------------------------------------
F1 Score Average: 0.6891102988918991


Compared to the lower level labels, the performance seems to improve across all these three categories. In the first category, "The saviour", the model is still labeling all paragraphs as containing this category, but the score improves because both lower level categories are now counting together for this category, so there are even fewer samples not containing this category at all.

But, it seems that having Presupposition and Authority Voice together in "The expert" helps improving the scores a little bit. The prediction on the lower level included only one positive predicitons for Authority Voice, the rest was predicted negative. With our higher level prediction we can see that the model is more confident in classifying some samples as positive, so we do get a fairly good precision, although it still falls short on recall and thus F1 Score.

In the third category, "The poet", we again have two lower level labels that were consistently predicted to never occur in the previous lower level classification, and we have the category of Compassion which was quite mixed but okay. Putting these three categories into one, we get improved measures overall, only the precision is dropping a little bit compared to having only compassion paragraphs.

So, all in all, predicting the higher level labels seems to improve our model quite a bit, but it still does not result in perfection. But, if it helps predicting labels, that previously were constantly missing as for Presupposition and Authority Voice in "The expert", then this could mean that these two labels often go together and verify the taxonomy in this particular point. But also the overall higher scores speak for the taxonomy and against the size of the dataset.

# k-Fold Cross Validation

At last, we want to see whether the Logistic Regression improves when training via a 10-fold cross validation, as has also been done by Pérez-Almendros et al. (2020). We will go back to the lower level labels with 7 categories for this experiment. Also, we will use the basic KFold implementation, as the StratifiedKFold cannot handle multilabel stratification.

In [25]:
from sklearn.model_selection import KFold
num_folds = 10
kfold = KFold(n_splits=num_folds, shuffle=True, random_state = 1)
models = []
train_scores = []
validation_scores = []
f1_scores = []
fold_no = 1

for train_index, test_index in kfold.split(X_train, Y_train):
    # multilabel logistic regression
    clf = MultiOutputClassifier(estimator= LogisticRegression(max_iter = 500)).fit(X_train[train_index], 
                                                                                   Y_train[train_index])
    models.append(clf)
    train_scores.append(clf.score(X_train[train_index], Y_train[train_index]))
    validation_scores.append(clf.score(X_train[test_index], Y_train[test_index]))
    f1_scores.append(dontpatronizeme.ext_evaluation.evaluate(Y_train[test_index], 
                                                             clf.predict(X_train[test_index]), 
                                                             'll', verbose = False))
    fold_no += 1
       
print('------------------------------------------------------------------------')
print('Score per fold')
for i in range(0, len(train_scores)):
    print('------------------------------------------------------------------------')
    print(f'> Fold {i+1}')
    print(f'\t - Mean Train Accuracy: {round(train_scores[i] *100, 4) }%')
    print(f'\t - Mean Validation Accuracy: {round(validation_scores[i] *100, 4)}%')
    print(f'\t - Mean F1 Score: {round(f1_scores[i] * 100, 4)}%')


print('------------------------------------------------------------------------')
print('Average scores for all folds:')
print(f'> Mean Train Accuracy: {round(np.mean(train_scores), 4)} (+- {round(np.std(train_scores), 4)})')
print(f'> Mean Validation Accuracy: {round(np.mean(validation_scores), 4)} (+- {round(np.std(validation_scores), 4)})')
print(f'> Mean F1 Score: {round(np.mean(f1_scores), 4)} (+- {round(np.std(f1_scores), 4)})')
print('------------------------------------------------------------------------')

------------------------------------------------------------------------
Score per fold
------------------------------------------------------------------------
> Fold 1
	 - Mean Train Accuracy: 30.4229%
	 - Mean Validation Accuracy: 18.2927%
	 - Mean F1 Score: 21.2969%
------------------------------------------------------------------------
> Fold 2
	 - Mean Train Accuracy: 30.2865%
	 - Mean Validation Accuracy: 21.9512%
	 - Mean F1 Score: 22.106%
------------------------------------------------------------------------
> Fold 3
	 - Mean Train Accuracy: 29.7408%
	 - Mean Validation Accuracy: 23.1707%
	 - Mean F1 Score: 23.0997%
------------------------------------------------------------------------
> Fold 4
	 - Mean Train Accuracy: 29.4679%
	 - Mean Validation Accuracy: 20.7317%
	 - Mean F1 Score: 22.4392%
------------------------------------------------------------------------
> Fold 5
	 - Mean Train Accuracy: 30.1501%
	 - Mean Validation Accuracy: 20.7317%
	 - Mean F1 Score: 20.6268

Seeing that the model from Fold 3 has the best F1 Score with 23%, we will now test this model with our separate test data:

In [26]:
Y_pred = models[3-1].predict(X_test)
dontpatronizeme.ext_evaluation.evaluate(Y_test, Y_pred, 'll')

Unbalanced Power Relations
Accuracy: 0.8022598870056498
Precision: 0.8068181818181818
Recall: 0.993006993006993
F1 Score: 0.890282131661442
Confusion Matrix: (tn, fp / fn, tp)
[[  0  34]
 [  1 142]]
--------------------------------------------------
Shallow Solution
Accuracy: 0.7796610169491526
Precision: 0.0
Recall: 0.0
F1 Score: 0.0
Confusion Matrix: (tn, fp / fn, tp)
[[138   0]
 [ 39   0]]
--------------------------------------------------
Presupposition
Accuracy: 0.7457627118644068
Precision: 0.0
Recall: 0.0
F1 Score: 0.0
Confusion Matrix: (tn, fp / fn, tp)
[[132   0]
 [ 45   0]]
--------------------------------------------------
Authority Voice
Accuracy: 0.7457627118644068
Precision: 1.0
Recall: 0.021739130434782608
F1 Score: 0.042553191489361694
Confusion Matrix: (tn, fp / fn, tp)
[[131   0]
 [ 45   1]]
--------------------------------------------------
Metaphor
Accuracy: 0.7796610169491526
Precision: 0.0
Recall: 0.0
F1 Score: 0.0
Confusion Matrix: (tn, fp / fn, tp)
[[138   0]
 [

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Here, we get an average F1 Score over all 7 categories of 0.21. This is slightly lower than the regular 0.22 average from the plain Logistic Regression. We also still get similar phenomena with cross validation, including always or never predicting certain categories to occur. This leads to the conclusion, that also the 10-fold cross validation did not have a significant impact on the training or selection of the Logistic Regression models, as the results are quite similar and did not improve much.