## Problem Description

In the scope of the DARGMINTS project, an annotation project was carried out which consisted of annotating argumentation structures in opinion articles published in the Público newspaper. The annotation included several layers:

1. Selecting text spans that are taken to have an argumentative role (either as premises or conclusions of arguments) -- these are Argumentative Discourse Units (ADU).
2. Connecting such ADUs through support or attack relations.
3. Classifying the propositional content of ADUs as propositions of fact, propositions of value, or propositions of policy; within propositions of value, distinguish between those with a positive (+) or negative (-) connotation.

In a proposition of fact, the content corresponds to a piece of information that can be checked for truthness. This does not usually happen with propositions of value, which denote value judgments with a strong subjective nature; often, they also have a (positive or negative) polarity attached. A proposition of policy prescribes or suggests a certain line of action, often mentioning the agents or entities that are capable of carrying out such policies.

The aim of this assignment is to build a classifier of types of ADUs, thus focusing on the last annotation step described above. For that, you have access to two different files:

- A file containing the content of each annotated ADU span and its 5-class classification: Value, Value(+), Value(-), Fact, or Policy. For each ADU, we also know the annotator and the document from which it has been taken.
- A file containing details for each opinion article that has been annotated, including the full article content.
Besides ADU contents, you can make use of any contextual information provided in the corresponding opinion article.

Each opinion article has been annotated by 3 different annotators. For that reason, you will find in the ADU file an indication of which annotator has obtained the ADU. It may happen that the same ADU has been annotated by more than one annotator. When that is the case, they do not necessarily agree on the type of proposition.

How good a classifier (or set of classifiers) can you get? Don't forget to properly split the dataset in a sensible manner, so that you have a proper test set. Start by obtaining an arbitrary baseline, against which you can then compare your improvements.

Portuguese NLKT: https://www.nltk.org/howto/portuguese_en.html

### Data Analysis

In [31]:
import pandas as pd
data = pd.read_excel('OpArticles/OpArticles.xlsx')
data_ADU = pd.read_excel('OpArticles/OpArticles_ADUs.xlsx')

In [32]:
# TODO

display("Articles Data", data.head())
display("ADU Data", data_ADU.head())

'Articles Data'

Unnamed: 0,article_id,title,authors,body,meta_description,topics,keywords,publish_date,url_canonical
0,5d04a31b896a7fea069ef06f,"Pouco pão e muito circo, morte e bocejo",['José Vítor Malheiros'],"O poeta espanhol António Machado escrevia, uns...","É tudo cómico na FIFA, porque todos os dias a ...",Sports,"['Brasil', 'Campeonato do Mundo', 'Desporto', ...",2014-06-17 00:16:00,https://www.publico.pt/2014/06/17/desporto/opi...
1,5d04a3fc896a7fea069f0717,Portugal nos Mundiais de Futebol de 2010 e 2014,['Rui J. Baptista'],“O mais excelente quadro posto a uma luz logo ...,Deve ser evidenciado o clima favorável criado ...,Sports,"['Brasil', 'Campeonato do Mundo', 'Coreia do N...",2014-07-05 02:46:00,https://www.publico.pt/2014/07/05/desporto/opi...
2,5d04a455896a7fea069f07ab,"Futebol, guerra, religião",['Fernando Belo'],1. As sociedades humanas parecem ser regidas p...,O futebol parece ser um sucedâneo quer da lei ...,Sports,"['A guerra na Síria', 'Desporto', 'Futebol', '...",2014-07-12 16:05:33,https://www.publico.pt/2014/07/12/desporto/opi...
3,5d04a52f896a7fea069f0921,As razões do Qatar para acolher o Mundial em 2022,['Hamad bin Khalifa bin Ahmad Al Thani'],Este foi um Mundial incrível. Vimos actuações ...,Queremos cooperar plenamente com a investigaçã...,Sports,"['Desporto', 'FIFA', 'Futebol', 'Mundial de fu...",2014-07-27 02:00:00,https://www.publico.pt/2014/07/27/desporto/opi...
4,5d04a8d7896a7fea069f6997,A política no campo de futebol,['Carlos Nolasco'],O futebol sempre foi um jogo aparentemente sim...,Retirar a expressão política do futebol é reti...,Sports,"['Albânia', 'Campeonato da Europa', 'Desporto'...",2014-10-23 00:16:00,https://www.publico.pt/2014/10/23/desporto/opi...


'ADU Data'

Unnamed: 0,article_id,annotator,node,ranges,tokens,label
0,5d04a31b896a7fea069ef06f,A,0,"[[2516, 2556]]",O facto não é apenas fruto da ignorância,Value
1,5d04a31b896a7fea069ef06f,A,1,"[[2568, 2806]]",havia no seu humor mais jornalismo (mais inves...,Value
2,5d04a31b896a7fea069ef06f,A,3,"[[3169, 3190]]",É tudo cómico na FIFA,Value
3,5d04a31b896a7fea069ef06f,A,4,"[[3198, 3285]]",o que todos nós permitimos que esta organizaçã...,Value
4,5d04a31b896a7fea069ef06f,A,6,"[[4257, 4296]]",não nos fazem rir à custa dos poderosos,Value


In [33]:
print("---------------------Articles---------------------")
data.info()
print("---------------------ADU---------------------")
data_ADU.info()

---------------------Articles---------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 373 entries, 0 to 372
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   article_id        373 non-null    object
 1   title             373 non-null    object
 2   authors           373 non-null    object
 3   body              373 non-null    object
 4   meta_description  373 non-null    object
 5   topics            373 non-null    object
 6   keywords          373 non-null    object
 7   publish_date      373 non-null    object
 8   url_canonical     373 non-null    object
dtypes: object(9)
memory usage: 26.4+ KB
---------------------ADU---------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16743 entries, 0 to 16742
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   article_id  16743 non-null  object
 1   annotator   16743 n

#### Description of tables

In [34]:
display("Articles data", data.describe(include='all'))
display("ADU data", data_ADU.describe(include='all'))

'Articles data'

Unnamed: 0,article_id,title,authors,body,meta_description,topics,keywords,publish_date,url_canonical
count,373,373,373,373,373,373,373,373,373
unique,373,373,373,373,373,8,356,372,373
top,5cf46cea896a7fea06003dd2,Como parar a charlatanice que são as medicinas...,"['Jean-Michel Casa', 'Christof Weil']","Atualmente, o excesso de peso é o principal re...",Temos de “atacar” a máquina fiscal espanhola. ...,Sports,"['Desporto', 'Opinião']",2016-08-04 00:30:00,https://www.publico.pt/2016/02/09/desporto/opi...
freq,1,1,1,1,1,52,4,2,1


'ADU data'

Unnamed: 0,article_id,annotator,node,ranges,tokens,label
count,16743,16743,16743.0,16743,16743,16743
unique,373,4,,11929,12008,5
top,5cf464b6896a7fea06ffbb9d,B,,"[[0, 138]]",Não é verdade,Value
freq,142,5226,,6,8,8102
mean,,,14.93896,,,
std,,,14.033932,,,
min,,,0.0,,,
25%,,,4.0,,,
50%,,,11.0,,,
75%,,,21.0,,,


### Preprocessing

#### Tokenization

In [35]:
def column2string(column):
    str_result = ""
    for i in column:
        str_result += i + ""
    return str_result

In [37]:
# Portuguese nltk
import nltk.test.portuguese_en_fixt 
import nltk
import re

portuguese_tokenizer=nltk.data.load('tokenizers/punkt/portuguese.pickle')  

#### Stemming

In [71]:
import nltk.stem

corpus = []
counter = 0
ps = nltk.stem.RSLPStemmer()
for i in range(0, data_ADU['tokens'].size):
    token = data_ADU['tokens'][i]
    # to lower-case
    token = token.lower()
    # split into tokens, apply stemming and remove stop words
    token = ' '.join([ps.stem(w) for w in token.split()])
    corpus.append(token)
    counter += 1


o fact não é apen frut da ignor
hav no seu hum mais jorn (mal investigação, mais preocup em aprofund e contextual a história, mais isenç no relato, mais preocup social, mais urg de denunciar) do que em muit peç real jorn
é tud cómic na fif
o que tod nó permit que est organiz faç é total absurd e sem sent
não no faz rir à cust do poder


In [69]:
print(data_ADU['tokens'].size)
print(counter)


16743
16743


#### Lemmatization

In [40]:
# TODO

#lemmatizer = LemmatizerModel.pretrained("lemma", "pt") \
#        .setInputCols(["token"]) \
#        .setOutputCol("lemma")
#nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer])
#light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
#results = light_pipeline.fullAnnotate(text_tokens)

### N-Gram Language Models

In [41]:
# TODO - António

### Text Classification

In [72]:
# TODO

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus).toarray()



print(X.shape)

y = data_ADU['label']

print(X.shape, y.shape)


(16743, 12599)
(16743, 12599) (16743,)


In [73]:
# TODO - improve division

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

print("\nLabel distribution in the training set:")
print(y_train.value_counts())

print("\nLabel distribution in the test set:")
print(y_test.value_counts())

(13394, 12599) (13394,)
(3349, 12599) (3349,)

Label distribution in the training set:
Value       6422
Fact        2967
Value(-)    2339
Value(+)    1137
Policy       529
Name: label, dtype: int64

Label distribution in the test set:
Value       1680
Fact         696
Value(-)     561
Value(+)     274
Policy       138
Name: label, dtype: int64


#### Models

In [89]:
# TODO - Model evaluation

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score

def model_evaluation(y_pred, y_test):
  print("------------------Confusion Matrix------------------")
  print(confusion_matrix(y_test, y_pred))
  print("------------------Accuracy------------------")
  print(accuracy_score(y_test, y_pred))
  print("------------------Precision------------------")
  print(precision_score(y_test, y_pred, average=None))
  print("------------------Recall------------------")
  print(recall_score(y_test, y_pred, average=None))
  print("------------------F1------------------")
  print(f1_score(y_test, y_pred, average=None))



In [115]:
def display(results):
    print(f'Best parameters are: {results.best_params_}')
    print("\n")
    mean_score = results.cv_results_['mean_test_score']
    std_score = results.cv_results_['std_test_score']
    params = results.cv_results_['params']
    for mean,std,params in zip(mean_score,std_score,params):
        print(f'{round(mean,3)} + or -{round(std,3)} for the {params}')

In [None]:
from sklearn.model_selection import cross_val_score

def cross_validation(model, kfold, X, y):
  return cross_val_score(model, X, y, cv=kfold)

In [85]:
# TODO - Naive Bayes

from sklearn.naive_bayes import MultinomialNB

clfNB = MultinomialNB()
clfNB.fit(X_train, y_train)

y_pred = clfNB.predict(X_test)
print(y_pred)

model_evaluation(y_pred, y_test)


['Value' 'Value' 'Value(-)' ... 'Value' 'Value' 'Fact']


In [95]:
from sklearn.naive_bayes import GaussianNB

clfGNB = GaussianNB()
clfGNB.fit(X_train, y_train)

y_pred = clfGNB.predict(X_test)
print(y_pred)

model_evaluation(y_pred, y_test)

['Value(+)' 'Policy' 'Value(+)' ... 'Fact' 'Fact' 'Value']
------------------Confusion Matrix------------------
[[229  71 131 109 156]
 [  4  78  28  13  15]
 [305 251 452 283 389]
 [ 36  49  45 120  24]
 [ 57  41 109  67 287]]
------------------Accuracy------------------
0.34816363093460734
------------------Precision------------------
[0.36291601 0.15918367 0.59084967 0.2027027  0.32950631]
------------------Recall------------------
[0.32902299 0.56521739 0.26904762 0.4379562  0.51158645]
------------------F1------------------
[0.34513941 0.24840764 0.36973415 0.27713626 0.40083799]


In [109]:
# TODO - Logistic Regression
# Regularization is applied by default

from sklearn.linear_model import LogisticRegression

clfLR = LogisticRegression(penalty='elasticnet', solver='saga', l1_ratio=0.5)

clfLR.fit(X_train, y_train)

y_pred = clfLR.predict(X_test)
print(y_pred)

model_evaluation(y_pred, y_test)




['Fact' 'Value' 'Value' ... 'Value' 'Value' 'Fact']
------------------Confusion Matrix------------------
[[ 239    4  395   13   45]
 [   8   33   87    2    8]
 [ 228   23 1250   53  126]
 [  30    8  157   68   11]
 [  54    7  325    5  170]]
------------------Accuracy------------------
0.5255300089578979
------------------Precision------------------
[0.42754919 0.44       0.56458898 0.4822695  0.47222222]
------------------Recall------------------
[0.3433908  0.23913043 0.74404762 0.24817518 0.3030303 ]
------------------F1------------------
[0.38087649 0.30985915 0.64201335 0.32771084 0.36916395]


In [119]:
# TODO - Decision Tree

from sklearn.tree import DecisionTreeClassifier

clfDT = DecisionTreeClassifier(max_depth=5)
#parameters= {
  #"criterion":['gini', 'entropy'],
  #"splitter":['best', 'random'],
  #"max_depth": [2,3,4,5, None]
#}

#cv = GridSearchCV(clfDT,parameters,cv=5)
#cv.fit(X_train, y_train.ravel())

#display(cv)

clfDT.fit(X_train, y_train)

y_pred = clfDT.predict(X_test)
print(y_pred)

model_evaluation(y_pred, y_test)

['Value' 'Value' 'Value' ... 'Value' 'Value' 'Fact']
------------------Confusion Matrix------------------
[[  24    0  670    0    2]
 [   0   13  125    0    0]
 [  13   14 1649    0    4]
 [   1    0  271    1    1]
 [  10    2  547    1    1]]
------------------Accuracy------------------
0.5040310540459839
------------------Precision------------------
[0.5        0.44827586 0.50551809 0.5        0.125     ]
------------------Recall------------------
[0.03448276 0.0942029  0.98154762 0.00364964 0.00178253]
------------------F1------------------
[0.06451613 0.15568862 0.66734116 0.00724638 0.00351494]


In [117]:
# TODO - Random Forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

clfRF = RandomForestClassifier(n_estimators=250)
#parameters= {
   #"n_estimators":[5,10,50,100,250],
   #"max_depth":[2,4,8,16,32,None]
#}

#cv = GridSearchCV(clfRF,parameters,cv=5)
#cv.fit(X_train, y_train.ravel())

#display(cv)

clfRF.fit(X_train, y_train)

y_pred = clfRF.predict(X_test)
print(y_pred)

model_evaluation(y_pred, y_test)

['Value' 'Value' 'Value' ... 'Value' 'Value' 'Fact']
------------------Confusion Matrix------------------
[[ 210    3  424   18   41]
 [   1   49   84    2    2]
 [ 164   31 1330   53  102]
 [  27    3  179   65    0]
 [  44    0  329    2  186]]
------------------Accuracy------------------
0.5494177366378024
------------------Precision------------------
[0.47085202 0.56976744 0.56692242 0.46428571 0.56193353]
------------------Recall------------------
[0.30172414 0.35507246 0.79166667 0.23722628 0.3315508 ]
------------------F1------------------
[0.36777583 0.4375     0.66070541 0.31400966 0.41704036]


In [116]:
display(cv)

Best parameters are: {'max_depth': None, 'n_estimators': 250}


0.479 + or -0.0 for the {'max_depth': 2, 'n_estimators': 5}
0.479 + or -0.0 for the {'max_depth': 2, 'n_estimators': 10}
0.479 + or -0.0 for the {'max_depth': 2, 'n_estimators': 50}
0.479 + or -0.0 for the {'max_depth': 2, 'n_estimators': 100}
0.479 + or -0.0 for the {'max_depth': 2, 'n_estimators': 250}
0.48 + or -0.0 for the {'max_depth': 4, 'n_estimators': 5}
0.479 + or -0.0 for the {'max_depth': 4, 'n_estimators': 10}
0.479 + or -0.0 for the {'max_depth': 4, 'n_estimators': 50}
0.479 + or -0.0 for the {'max_depth': 4, 'n_estimators': 100}
0.479 + or -0.0 for the {'max_depth': 4, 'n_estimators': 250}
0.48 + or -0.0 for the {'max_depth': 8, 'n_estimators': 5}
0.48 + or -0.001 for the {'max_depth': 8, 'n_estimators': 10}
0.479 + or -0.0 for the {'max_depth': 8, 'n_estimators': 50}
0.48 + or -0.0 for the {'max_depth': 8, 'n_estimators': 100}
0.479 + or -0.0 for the {'max_depth': 8, 'n_estimators': 250}
0.482 + or -0.003 fo

In [112]:
# TODO - SVM 

from sklearn import svm

clfSVC = svm.SVC(kernel="sigmoid")

clfSVC.fit(X_train, y_train)

y_pred = clfSVC.predict(X_test)
print(y_pred)

model_evaluation(y_pred, y_test)

['Fact' 'Fact' 'Value' ... 'Value' 'Value(-)' 'Fact']
------------------Confusion Matrix------------------
[[ 267    2  351    2   74]
 [  30    8   86    0   14]
 [ 445   10 1010    2  213]
 [  80    0  157    7   30]
 [ 179    2  278    0  102]]
------------------Accuracy------------------
0.41624365482233505
------------------Precision------------------
[0.26673327 0.36363636 0.53666312 0.63636364 0.23556582]
------------------Recall------------------
[0.38362069 0.05797101 0.60119048 0.02554745 0.18181818]
------------------F1------------------
[0.31467295 0.1        0.56709714 0.04912281 0.20523139]


### Regularization

In [None]:
# TODO - Leonor

