## Experimental Setting 2: Features + rbf SVM 
### Task 1: Classification Argument (contains either Claim or Premise) vs non-Argument 

**Importing necessary packages**

In [56]:
import pandas as pd
import numpy as np
import string
import collections
import spacy
nlp = spacy.load('en_core_web_sm')
from tensorflow.keras.utils import to_categorical
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag, word_tokenize, RegexpParser

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

[nltk_data] Downloading package punkt to /Users/mariap/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/mariap/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
filename = '../../../data/sentence_db_candidate.csv'
df = pd.read_csv(filename)

**Simple preprocessing of sentences with lowercasing and punctuation removal**

In [3]:
def preproc(sentence):
    sentence = sentence.lower()
    sentence = ''.join([i for i in sentence if i not in string.punctuation])
    return sentence

In [4]:
df['Speech'] = df['Speech'].apply(preproc)

In [5]:
valid = ['Claim', 'Premise', 'O']
df = df.loc[(df['Component'].isin(valid))]

**Below we are turning labels marking Claims, Premises and None into machine-readable classes: 1 for claims and premises and 0 for none**

In [6]:
classes = []

for s in df.Component:
    if s == 'O':
        classes.append(0.0)
    else:
        classes.append(1.0)

In [7]:
df['Annotation'] = classes
df['Annotation'].value_counts()

1.0    22280
0.0     7252
Name: Annotation, dtype: int64

In [8]:
df = df[['Speech', 'Annotation', 'Set']]

In [9]:
df.head()

Unnamed: 0,Speech,Annotation,Set
0,gwen i want to thank you and i want to thank ...,0.0,TRAIN
1,its a very important event and theyve done a s...,0.0,TRAIN
2,its important to look at all of our developmen...,0.0,TRAIN
3,and after 911 it became clear that we had to d...,1.0,TRAIN
4,and we also then finally had to stand up democ...,1.0,TRAIN


### Part 1: Feature Engineering

**Now we are starting feature engineering. The functions are designed to take the full dataframe and a column with sentences and to output a dataframe with a newly added feature(s). Please refer to `linguistic_features.py` file to see/copy the documented feature functions**

### *Features: Part-of-Speech (adverbs and adjectives)*

In [10]:
def list_count(lst):
    
    """
    :function: count the elements of a list -- the number of words with a respective POS or NER labels in a sentence. 
    :input: lst: list of tuples, where tuple has two elements -- a word and its POS or NER label
    :return: lst_count: list of dictionaries, where
    the dictionary consists of keys -- the elements are words and their POS or NER labels
    and values -- how many times each word and its POS or NER label occurs
    If a sentence has no POS or NER labels, return an empty list 

    """
    
    dic_counter = collections.Counter()
    
    for x in lst:
        dic_counter[x] += 1
    
    dic_counter = collections.OrderedDict( 
                     sorted(dic_counter.items(), 
                     key=lambda x: x[1], reverse=True))
    
    lst_count = [{key:value} for key,value in dic_counter.items()]
    
    return lst_count

In [11]:
def column_tag(lst_dics_tuples, tag):
    
    """
    :function: new column for each POS or NER tag category 
    :input: lst_dics_tuples: list of dictionaries with tuples 
            tag: POS or NER label from a list
    :return: tag: new column for each POS or NER label with their counts

    """
    
    if len(lst_dics_tuples) > 0:
        tag_type = []
        
        for dic_tuples in lst_dics_tuples:
            for tuple in dic_tuples:
                type, n = tuple[1], dic_tuples[tuple]
                tag_type = tag_type + [type]*n
                dic_counter = collections.Counter()
                for x in tag_type:
                    dic_counter[x] += 1
        return dic_counter[tag]
    
    else:
        return 0

In [12]:
def pos_features (df, speech_sents):
    
    """
    :function: add two new columns with two POS: adjectives and adverbs, and their counts per sentence.
    Two helper functions -- list_count, column_tag -- are needed 
    :input: df: entire DataFrame
            speech_sents: Series of sentences in DataFrame
    :return: df: new DataFrame with two new features

    """
    
    
    df['pos'] = speech_sents.apply(lambda x: [(tag.text, tag.pos_) 
                                for tag in nlp(x)])
    
    df['pos'] = df['pos'].apply(lambda x: list_count(x))
    
    #extract features
    tags_set = []

    for lst in df['pos'].tolist():
        for dic in lst:
            for k in dic.keys():
                tags_set.append(k[1])
            
    tags_set = list(set(tags_set))

    for feature in tags_set:
        df['pos_' + feature] = df['pos'].apply(lambda x: column_tag(x, feature))
        
    # keeping only adverbs and adjectives and dropping other pos
    for feature in df.columns:
        if feature != 'pos_ADV' and feature != 'pos_ADJ' and 'pos' in feature:
            df = df.drop(feature, axis=1)
    
    return df

In [13]:
df = pos_features(df, df['Speech'])

### *Features: Named Entity Recognition labels*

In [14]:
def ner_features(df, speech_sents):
    
    """
    :function: add several new columns with NER labels, and their counts per sentence.
    Two helper functions -- list_count, column_tag -- are needed 
    :input: df: entire DataFrame
            speech_sents: Series of sentences in DataFrame
    :return: df: new DataFrame with new features for each NER label

    """
    
    df['ner'] = speech_sents.apply(lambda x: [(tag.text, tag.label_) 
                                for tag in nlp(x).ents])
    # count tags
    df['ner'] = df['ner'].apply(lambda x: list_count(x))
    
    # extract features
    tags_set = []

    for lst in df['ner'].tolist():
        for dic in lst:
            for k in dic.keys():
                tags_set.append(k[1])
            
    tags_set = list(set(tags_set))

    for feature in tags_set:
        df['ner_' + feature] = df['ner'].apply(lambda x: column_tag(x, feature))
        
    df = df.drop(['ner'], axis=1)
    
    return df 

In [15]:
df = ner_features(df, df['Speech'])

### *Features: Verbs' tenses and modals*

In [16]:
def verbs_features (df, speech_sents):
    
    """
    :function: add several new columns with features for verb tenses and the presence of modal verbs, 
    and their counts per sentence.
    Two helper functions -- list_count, column_tag -- are needed 
    :input: df: entire DataFrame
            speech_sents: Series of sentences in DataFrame
    :return: df: new DataFrame with features for each verb tense and for modal verbs

    """
    
    df['verb_tag'] = speech_sents.apply(lambda x: [(tag.text, tag.tag_) 
                                for tag in nlp(x)])
    
    df['verb_tag'] = df['verb_tag'].apply(lambda x: list_count(x))
    
    #extract features
    verbs_set = []

    for lst in df['verb_tag'].tolist():
        for dic in lst:
            for k in dic.keys():
                verbs_set.append(k[1])
            
    verbs_set = list(set(verbs_set))

    for feature in verbs_set:
        df['verb_tag_' + feature] = df['verb_tag'].apply(lambda x: column_tag(x, feature))
    
    #out of all detailed POS tags, keeping only verbs-related ones 
    for f in df.columns:
        if f != 'verb_tag_VB' and f != 'verb_tag_VBZ' and f != 'verb_tag_VBP' and f != 'verb_tag_VBD' and f != 'verb_tag_VBN' and f != 'verb_tag_VBG' and f != 'verb_tag_MD' and 'verb_tag' in f:
            df = df.drop(f, axis=1)
    
    
    return df

In [17]:
df = verbs_features(df, df['Speech'])

### *Features: Syntactic features (the number of productions, the number of Verbal Phrases groups, the depth of a sentence tree)*

In [18]:
def syntactic_features(df, speech_sents):
        
    """
    :function: add syntactic features -- 1) the number of productions, 2) the number of VP groups per sentence, 
    and 3) the depth of a sentence tree 
    :input: df: entire DataFrame
            speech_sents: Series of sentences in DataFrame
    :return: df: new DataFrame with three syntactic features 

    """
  
    a, b , c, d, e = [], [], [], [], []
    for x, y in enumerate(speech_sents):
        tagged = pos_tag(word_tokenize(y))
        chunker = RegexpParser(r"""
            NBAR:
            {<NN.*|JJ>*<NN.*>}  
            VP:
            {<V.*>}  
            NP:
            {<NBAR>}
            {<NBAR><IN><NBAR>}  
        """)
    
        a.append(chunker.parse(tagged))
        b.append(len(chunker.parse(tagged).productions()))
        e.append(chunker.parse(tagged).productions())
        c.append(chunker.parse(tagged).height())

    df.loc[:, 'Speech_parsed'] = a
    df.loc[:, 'Productions_count'] = b
    df.loc[:, 'Tree_depth'] = c
  

    for i in e:
        vp = []
        for u in i:
            if str(u).startswith('VP'):
                vp.append(u)
        d.append(len(vp))
  
    df.loc[:, 'VP_count'] = d
    
    df = df.drop(['Speech_parsed'], axis=1)
    
    return df

In [21]:
df = syntactic_features(df, df['Speech'])



  val = np.array(val, copy=False)
  return asarray(a).ndim


### *Feature: Sentiment of a sentence*

In [22]:
def add_sentiment (df, speech_sents): 
    
    
    """ 
    :function: add a feature with a sentiment score for each sentence 
    :input: df: entire DataFrame
            speech_sents: Series of sentences in DataFrame
    :return: df: DataFrame with a new feature Sentiment
    
    """
    
    analyzer = SentimentIntensityAnalyzer()

    senti = []
    
    for sent in speech_sents:
        vs = analyzer.polarity_scores(sent)
        senti.append([list(vs.values())[3]])
    
    senti_arr = np.array(senti)
    df['Sentiment'] = senti_arr
    
    return df 

In [23]:
df = add_sentiment(df, df['Speech'])

### *Feature: Discourse connectives*

In [24]:
def add_connectives (df, speech_sents):

    """ 
    :function: add a boolean feature based on presence/absence of a claim connective from the pre-defined list 
    :input: df: entire DataFrame
            speech_sents: Series of sentences in DataFrame
    :return: df: DataFrame with a new feature Claim_Connective
    
    """
    
    connectives = ['so that', 'as a result', 'therefore', 'thus', 'thereby', 'in the end', 'hence', 'accordingly', 'in this way', 'because', 'now that', 'insofar as', 'given that', 'in response to', 'consequently', 'as a consequence']
    lst = []
    
    for sent in speech_sents:
        if any(w in sent for w in connectives):
            lst.append(1)
        else:
            lst.append(0)
    df['Claim_Connective'] = lst
    
    return df

In [25]:
df = add_connectives(df, df['Speech'])

### *Features: Personal pronouns*

In [26]:
def add_personal(df, speech_sents):
    """
    :function: add two boolean features based on the presence/absence of any pronoun from two given lists.
    :input: df: entire DataFrame
            speech_sents: Series of sentences in DataFrame
    :return: df: DataFrame with two new features Pronoun_Singular and Pronoun_Plural

    """

    singular = [' i ', ' me ', ' my ', ' myself ', ' mine ']
    plural = [' we ', ' our ', ' ours ', ' ourselves ']
    lst_sing = []
    lst_plur = []

    for sent in speech_sents:
        if any(w in sent for w in singular):
            lst_sing.append(1)
        else:
            lst_sing.append(0)
    df['Pronoun_Singular'] = lst_sing

    for sent in speech_sents:
        if any(w in sent for w in plural):
            lst_plur.append(1)
        else:
            lst_plur.append(0)
    df['Pronoun_Plural'] = lst_plur

    return df

In [27]:
df = add_personal(df, df['Speech'])

### Tf-idf uni-, bi- and tri-grams and training preparations

**Splitting the data into three sets. Our sets will be identical to those of authors**

In [28]:
df_train = df[df['Set'] == 'TRAIN']
df_val = df[df['Set'] == 'VALIDATION']
df_test = df[df['Set'] == 'TEST']

In [29]:
print(df_train.shape)
print(df_val.shape)
print(df_test.shape)

(14044, 37)
(7033, 37)
(8455, 37)


`(14044, 37)
(7033, 37)
(8455, 37)`

In [30]:
df_train = df_train.drop(['Set'], axis=1)
df_val = df_val.drop(['Set'], axis=1)
df_test = df_test.drop(['Set'], axis=1)

**Separating features set and target variable set**

In [50]:
X_train = df_train.drop(['Annotation'], axis=1)
y_train = df_train.Annotation

X_val = df_val.drop(['Annotation'], axis=1)
y_val = df_val.Annotation

X_test = df_test.drop(['Annotation'], axis=1)
y_test = df_test.Annotation

**Initializing tf-idf feature matrix. Fitting and transforming sentences on a train set and only transforming on a validation and test sets. Here we are using unigrams as well as bigrams and trigrams, set by the parameter `ngram_range`. Following the authors' practice, we will be using only the top occuring unigrams from the vocabulary, 10.000 n-grams**

In [33]:
vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1,3))

#tf-idf
train_vecs =  vectorizer.fit_transform(X_train['Speech'])
val_tfidf = vectorizer.transform(X_val['Speech'])
test_vecs = vectorizer.transform(X_test['Speech'])

**Transformations of tf-idf matrix for further concatenation with other features**

In [34]:
names = vectorizer.get_feature_names()
dense = train_vecs.todense()
denselist = dense.tolist()
fe = pd.DataFrame(denselist, columns = names)

In [35]:
X_train = X_train.drop(['Speech'], axis=1)

In [36]:
train_features = np.hstack([X_train, fe])

In [37]:
#testing preparations
names = vectorizer.get_feature_names()
dense = test_vecs.todense()
denselist = dense.tolist()
fe = pd.DataFrame(denselist, columns = names)

In [38]:
X_test = X_test.drop(['Speech'], axis=1)

In [39]:
test_features = np.hstack([X_test, fe])

### Part 2: Replication

**First of all, repeating the authors' setting. Kernel is `rbf`, penalty parameter `C=10`**

In [42]:
svm = SVC(kernel='rbf', C=10, random_state=42)
svm.fit(train_features, y_train)

SVC(C=10, random_state=42)

In [43]:
y_pred_test_svm = svm.predict(test_features)

**class 1 stands for ARGUMENT, class 0 stands for NONE**

In [44]:
target_names = ['class 0', 'class 1']
print(classification_report(y_test, y_pred_test_svm, target_names=target_names, digits=3))

              precision    recall  f1-score   support

     class 0      0.807     0.265     0.399      1880
     class 1      0.824     0.982     0.896      6575

    accuracy                          0.822      8455
   macro avg      0.815     0.623     0.647      8455
weighted avg      0.820     0.822     0.785      8455



               precision    recall  f1-score   support

     class 0      0.807     0.265     0.399      1880
     class 1      0.824     0.982     0.896      6575

    accuracy                          0.822      8455
    macro avg     0.815     0.623     0.647      8455
    weighted avg  0.820     0.822     0.785      8455


### Part 3: Hyperparameter tuning

**Our results are close to authors', but not as high. We perform hyperparameter tuning to improve the scores.**

In [45]:
#validation set preparation
names = vectorizer.get_feature_names()
dense = val_tfidf.todense()
denselist = dense.tolist()
fe = pd.DataFrame(denselist, columns = names)

In [52]:
X_val = X_val.drop(['Speech'], axis=1)

In [53]:
val_features = np.hstack([X_val, fe])

**First, initializing the parameters grid. We are tuning parameters `C` and `gamma`**

In [58]:
param_grid = {'C': [1, 10], 'gamma': [1, 0.1]}

In [59]:
grid = GridSearchCV(SVC(kernel='rbf'), param_grid, refit=True, verbose=2)
grid.fit(val_features, y_val)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] C=1, gamma=1 ....................................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ..................................... C=1, gamma=1, total= 9.0min
[CV] C=1, gamma=1 ....................................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  9.0min remaining:    0.0s


[CV] ..................................... C=1, gamma=1, total= 9.1min
[CV] C=1, gamma=1 ....................................................
[CV] ..................................... C=1, gamma=1, total= 8.3min
[CV] C=1, gamma=1 ....................................................
[CV] ..................................... C=1, gamma=1, total= 8.2min
[CV] C=1, gamma=1 ....................................................
[CV] ..................................... C=1, gamma=1, total= 8.2min
[CV] C=1, gamma=0.1 ..................................................
[CV] ................................... C=1, gamma=0.1, total= 6.8min
[CV] C=1, gamma=0.1 ..................................................
[CV] ................................... C=1, gamma=0.1, total= 7.8min
[CV] C=1, gamma=0.1 ..................................................
[CV] ................................... C=1, gamma=0.1, total= 8.0min
[CV] C=1, gamma=0.1 ..................................................
[CV] .

[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed: 167.7min finished


GridSearchCV(estimator=SVC(), param_grid={'C': [1, 10], 'gamma': [1, 0.1]},
             verbose=2)

In [60]:
grid.best_params_

{'C': 1, 'gamma': 0.1}

`'C': 1, 'gamma': 0.1`

### Part 4: Training and testing the model with the best parameters; a linear SVM

**As the tuning above shows, the best parameters are `C=1` and `gamma=0.1`. Now we shall train and test the model with this parameter setting**

In [61]:
svm_best = SVC(kernel='rbf', C=1, gamma=0.1, random_state=42)

In [62]:
svm_best.fit(train_features, y_train)

SVC(C=1, gamma=0.1, random_state=42)

In [63]:
y_pred_test_svm_best = svm_best.predict(test_features)

In [64]:
target_names = ['class 0', 'class 1']
print(classification_report(y_test, y_pred_test_svm_best, target_names=target_names, digits=3))

              precision    recall  f1-score   support

     class 0      0.757     0.263     0.391      1880
     class 1      0.822     0.976     0.893      6575

    accuracy                          0.817      8455
   macro avg      0.790     0.620     0.642      8455
weighted avg      0.808     0.817     0.781      8455



                precision    recall  f1-score   support

     class 0      0.757     0.263     0.391      1880
     class 1      0.822     0.976     0.893      6575

    accuracy                          0.817      8455
    macro avg     0.790     0.620     0.642      8455
    weighted avg  0.808     0.817     0.781      8455

**Unfortunately, the model with best parameters shown on the validation set, did not result in performance improvement on the testing set (compared to the original model above). To still try to improve the scores, we shall try `linear` kernel with default `C` and `gamma`.**

In [65]:
svm_lin = SVC(kernel='linear', C=1, random_state=42)

In [66]:
svm_lin.fit(train_features, y_train)

SVC(C=1, kernel='linear', random_state=42)

In [67]:
y_pred_test_svm_lin = svm_lin.predict(test_features)

In [68]:
target_names = ['class 0', 'class 1']
print(classification_report(y_test, y_pred_test_svm_lin, target_names=target_names, digits=3))

              precision    recall  f1-score   support

     class 0      0.700     0.387     0.498      1880
     class 1      0.845     0.953     0.895      6575

    accuracy                          0.827      8455
   macro avg      0.772     0.670     0.697      8455
weighted avg      0.812     0.827     0.807      8455



               precision    recall  f1-score   support

     class 0      0.700     0.387     0.498      1880
     class 1      0.845     0.953     0.895      6575

    accuracy                          0.827      8455
    macro avg     0.772     0.670     0.697      8455
    weighted avg  0.812     0.827     0.807      8455


**Conclusion: With linear kernel, we got closer to the original results. Our average f1-score for both classes is 0.807 while the one of authors is 0.823. For non-argumentative class, our f1-score is higher: 0.498 vs 0.433**