***Description***

<div> This notebook extracts three features from text in the articles: sentiment (VADER; Hutto & Gilbert 2014), (Stanford POS tagger; Toutanova et al. 2003), argumentation features (AlKhatib et al. 2016; Alhindi et al. 2020), on sentence level.

<div> It consists of helper functions for importing data, importing argmentation feature detecting models (hence, ArgFeat models), extracting features.

In [None]:
#!pip install nltk

In [8]:
import nltk
import numpy as np
import os
import pandas as pd
import torch

from collections import Counter
from glob import glob
from nltk import word_tokenize, StanfordTagger
from nltk.data import load
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn import preprocessing
from transformers import BertTokenizerFast, BertForSequenceClassification

nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('vader_lexicon')
nltk.download('tagsets')

# import VADER
sid = SentimentIntensityAnalyzer()

# import POS tag_dict and label encoder
tagdict = load('help/tagsets/upenn_tagset.pickle')
le = preprocessing.LabelEncoder()
le.fit(list(tagdict.keys()))

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/users/rldall/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to
[nltk_data]     /home/users/rldall/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/users/rldall/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package tagsets to
[nltk_data]     /home/users/rldall/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


In [2]:
# Show list of all POS tags and their descriptions
print(nltk.help.upenn_tagset())

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

# Helper functions

In [3]:
# Helper functions to import data

def select_files(path, startwith):
    list_of_files = []
    files = os.listdir(path)
    for file in files:
        if file.startswith(startwith):
            list_of_files.append(str(path)+str(file))
    return list_of_files

def transform_input(file):
    # import data
    df = pd.read_csv(file, sep='\t', header=None)
    text_series = df[1]
    text_token = []
    # tokenize sentences
    for t in text_series:
        sent_token = sent_tokenize(t)
        text_token.append(sent_token)
    # new column
    df[2] = text_token
    # change labels
    df[0] = df[0].map({'news':0 , 'editorial':1})
    return df

In [4]:
# Helper functions to extract VADER sentiment

def format_sent(compound_score):
    polarity = 0
    if(compound_score>= 0.05):
        polarity = 1
    elif(compound_score<= -0.05):
        polarity = -1
    return polarity

def get_scores(text):
    scores = sid.polarity_scores(text)
    return np.array([scores.get(s) for s in scores])

def get_sentiment(df,series_col,df_idx):
    series = df[series_col]
    error_list = []
    compound_list = []
    sum_list = []
    
#    for article in series:
    for idx in range(len(series)):
        article = series.iloc[idx]       
        try:
            scores = [get_scores(text) for text in article]
            compound_list.append([s[-1] for s in scores])        
            sum_list.append([format_sent(s[-1]) for s in scores])
            
        except:
            print('Error line:',idx)
            error_list.append(idx)

    # new column
    df['sent_compound'] = compound_list
    df['sent_sum'] = sum_list
    
    df = df.drop(error_list)
#    df.to_csv(list_of_files[df_idx],sep='\t',header=False,index=False)
    print('Saved\t', list_of_files[df_idx].split('/')[-1])
    
    return df

In [5]:
# Helper functions to extract POS tags

def predict_pos(text):
    text_tok = nltk.word_tokenize(text)
    return [word_class for word, word_class in nltk.pos_tag(text_tok)]

def get_pos(df,series_col,df_idx):
    series = df[series_col]
    error_list = []
    pos_list = []
    for idx in range(len(series)):
        article = series.iloc[idx]
        try:
            article_pos = [predict_pos(sent) for sent in article]
            pos_list.append(article_pos)
        except:
            print('Error line:',idx)
            error_list.append(idx)
    # new column
    df['pos'] = pos_list
    df = df.drop(error_list)
#    df.to_csv(list_of_files[df_idx],sep='\t',header=False,index=False)
    print('Saved\t', list_of_files[df_idx].split('/')[-1])
    return df

# pad and buils np.array for each sentence
def pad_sent(sent, max_padding):
    sent_pos = predict_pos(sent)
    sent_le = le.transform(sent_pos)
    sent_le = [st for st in sent_le if st in list(le.classes_)] ####
    sent_pos_padded = np.pad(np.array(sent_le), (0, max_padding- len(sent_pos)%max_padding) , 'constant', constant_values=(PAD_VALUE))
    return sent_pos_padded

def pad_article(article, max_padding, max_sent):
    # input list from sent-token
    art_pos_pad = np.empty(shape=(max_sent, max_padding))
    art_pos_pad.fill(PAD_VALUE)
    for i,sent in enumerate(article):
        if i < max_sent:
            art_pos_pad[i] = pad_sent(sent,max_padding)
    return art_pos_pad

# count method in each sent
def counter_pos(article,max_sents):
    a =[]
    for idx,sent in enumerate(article):
        sent_pos = predict_pos(sent)
        count_pos = Counter(sent_pos)
        a.append(dict(count_pos))
    return a
        
def pos_count_article(counter_result, list_index,max_sents):
    article_pos_count_array = np.zeros(shape=(len(le.classes_),max_sents))
    for i,sent_pos_count in enumerate(counter_result):
        for j in sent_pos_count:
            article_pos_count_array[i,list_index.index(j)] = sent_pos_count.get(j)
    return article_pos_count_array

In [6]:
# Helper functions to extract argfeat prediction

def load_model(name):
    # name in form of numlabel_epochs
    for f in os.listdir('/data/ArgFeatModel/ModelWeights/'):
        if f.startswith('saved_weights_'+name):
            model_path = ('/data/ArgFeatModel/ModelWeights/'+f)
    loaded_model =  BertForSequenceClassification.from_pretrained('bert-base-cased',num_labels = int(name[0]))
    loaded_model.load_state_dict(torch.load(model_path))
    loaded_model.eval()
    loaded_model.to(device)
    return loaded_model

# sent preprocessing
def get_sent_argfeat(sent,tokenizer,model):
    # token IDs and attention mask for inference on the new sentence
    test_ids = []
    test_attention_mask = []
    # apply the tokenizer
    encoding = tokenizer.encode_plus(
                        sent,
                        add_special_tokens = True,
                        max_length = 256,
                        padding = True,
                        return_attention_mask = True,
                        return_tensors = 'pt'
                   )
    # extract IDs and attention mask
    test_ids.append(encoding['input_ids'])
    test_attention_mask.append(encoding['attention_mask'])
    test_ids = torch.cat(test_ids, dim = 0)
    test_attention_mask = torch.cat(test_attention_mask, dim = 0)
    with torch.no_grad():
        output = model(test_ids.to(device), token_type_ids = None, attention_mask = test_attention_mask.to(device))
    # get prediction
    pred = np.argmax(output.logits.cpu().numpy()).flatten().item()
    return pred

def get_argfeat(df,series_col,model, max_sent, df_idx):
    print ('Extracting from', list_of_files[df_idx])
    series = df[series_col]
    error_list = []
    count_long = 0
    pred_list = []
    error_list = []
    for idx in range(len(series)):
        article = series.iloc[idx]
        if len(article) > max_sent:
            try:
                pred_text = [get_sent_argfeat(sent,tokenizer,model)+1 for sent in article[:max_sent]]
                count_long += 1
            except:
                print('Error line:',idx)
                error_list.append(idx)
        else:
            try:
                pred_text = [get_sent_argfeat(sent,tokenizer,model)+1 for sent in article]# + [0] * (N-len(sent_token))
            except:
                print('Error line:',idx)
                error_list.append(idx)
        pred_list.append(pred_text)
    print('long articles:',count_long,'from',len(df))
    print('percent of long articles:', count_long/(count_long+len(df)))
    flat_list = [item for sublist in pred_list for item in sublist]
    num_label = max(flat_list)
    # new column
    df[str('argfeat'+str(num_label))] = pred_list
    df = df.drop(error_list)
#    df.to_csv(list_of_files[df_idx],sep='\t',header=False,index=False)
    print('Saved\t', list_of_files[df_idx].split('/')[-1])
    return df

# Main function

In [7]:
import warnings
warnings.filterwarnings('ignore')

In [9]:
MAXLEN= 100
PAD_VALUE = 80
MAX_SENT_PAD = 50
MAX_SENTS = MAXLEN
MAX_POS_PAD = 2000

#select group of files
list_of_files = select_files('/data/ProcessedNYT/','train_f') # processing training data (finance 1996;2005)
#list_of_files = select_files('/data/ProcessedNYT/','all') # processing all data on six topics in 1986;1996;2005
#list_of_files = select_files('/data/ProcessedNYT/','test') # processing six topics in 1986
#print(list_of_files)

# transform files into dataframes
list_of_dataframes = [transform_input(file) for file in list_of_files]

# get sentiment columns
print('loading sent...')
list_of_sent_dfs = [get_sentiment(df,2,df_idx) for df_idx,df in enumerate(list_of_dataframes)]

# get pos columns
print('loading pos...')
list_of_pos_dfs = [get_pos(df,2,df_idx) for df_idx,df in enumerate(list_of_dataframes)]

# specify GPU
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Load the BERT tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

# import arg-feat model
print('loading model...')
model3 = load_model("3_5")
model6 = load_model("6_3")

# get argfeat columns
print('loading argfeat...')
list_of_argfeat3_dfs = [get_argfeat(df,2, model3, MAX_SENTS, df_idx) for df_idx,df in enumerate(list_of_dataframes)]
list_of_argfeat6_dfs = [get_argfeat(df,2, model6, MAX_SENTS, df_idx) for df_idx,df in enumerate(list_of_dataframes)]

# final save
for df_idx, df in enumerate(list_of_argfeat6_dfs):
    df.to_csv(list_of_files[df_idx], sep='\t',header=False, index=False)

loading sent...
Saved	 train_finance.txt
loading pos...
Saved	 train_finance.txt
loading model...


Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

loading argfeat...
Extracting from /data/ProcessedNYT/train_finance.txt
long articles: 38 from 2783
percent of long articles: 0.01347040056717476
Saved	 train_finance.txt
Extracting from /data/ProcessedNYT/train_finance.txt
long articles: 38 from 2783
percent of long articles: 0.01347040056717476
Saved	 train_finance.txt
