# PreSumm

**Source**:

Code: https://github.com/nlpyang/PreSumm/


Paper: https://arxiv.org/abs/1908.08345

#### Pre-requisities

**Libraries**: 

Torch 1.1.0 (download instructions from https://pytorch.org/get-started/previous-versions/)

**Stanford CoreNLP**

We will need Stanford CoreNLP to tokenize the data. Download it [here](https://stanfordnlp.github.io/CoreNLP/) and unzip it. Then add the following command to your bash_profile:
```
export CLASSPATH=/path/to/stanford-corenlp-full-2017-06-09/stanford-corenlp-3.8.0.jar
```
replacing `/path/to/` with the path to where you saved the `stanford-corenlp-full-2017-06-09` directory. 


#### Code

In [1]:
from others.tokenization import BertTokenizer

In [2]:
import lzma
import os
import json
import re

from pandas.io.json import json_normalize
import pandas as pd
from bs4 import BeautifulSoup
import subprocess
import torch
import lxml
import numpy as np

# Read data

In [3]:
base_path = "./data/xml"
state='north_carolina.xz'
f = lzma.open(os.path.join(base_path,state),"rb")
state_data = f.readlines()
f.close()
data_json = [json.loads(line) for line in state_data]
print(f'Flattening data for {state}')
data = json_normalize(data_json)

Flattening data for north_carolina.xz


  


In [4]:
data['decision_date_p'] = pd.to_datetime(data.decision_date,errors='coerce')
data['decision_year'] = data.decision_date_p.dt.year

In [5]:
data_2008 = data[data.decision_year>=2008]

# Tokenize Data

In [6]:
def tokenize(raw_path,save_path):
    stories_dir = os.path.abspath(raw_path)
    tokenized_stories_dir = os.path.abspath(save_path)

    print("Preparing to tokenize %s to %s..." % (stories_dir, tokenized_stories_dir))
    stories = os.listdir(stories_dir)
    # make IO list file
    print("Making list of files to tokenize...")
    with open("mapping_for_corenlp.txt", "w") as f:
        for s in stories:
            f.write("%s\n" % (os.path.join(stories_dir, s)))
    command = ['java', 'edu.stanford.nlp.pipeline.StanfordCoreNLP', '-annotators', 'tokenize,ssplit',
               '-ssplit.newlineIsSentenceBreak', 'always', '-filelist', 'mapping_for_corenlp.txt', '-outputFormat',
               'json', '-outputDirectory', tokenized_stories_dir]
    print("Tokenizing %i files in %s and saving in %s..." % (len(stories), stories_dir, tokenized_stories_dir))
    subprocess.call(command)
    print("Stanford CoreNLP Tokenizer has finished.")
    os.remove("mapping_for_corenlp.txt")


In [7]:
sample_data = data_2008.iloc[:100]

In [8]:
for row in data_2008.iterrows():
    try:
        caseid = row[1].id
        markup = row[1]['casebody.data']
        soup = BeautifulSoup(markup, "xml")
        opinion = soup.find_all('opinion')[0]
        opinion_text = opinion.getText()
        opinion_text = opinion_text.encode("ascii", "ignore").strip().decode("ascii")
        headnotes = (' '.join([headnotes.getText() for headnotes in soup.find_all('headnotes')])).replace('\n', ' ')
        headnotes = headnotes.encode("ascii", "ignore").strip().decode("ascii")

        if (len(headnotes) > 150 and len(opinion_text)>len(headnotes)):
            with open(f'presumm_data/parsed_text/opinions/{caseid}.txt','w') as f:
                f.write(opinion_text)

            with open(f'presumm_data/parsed_text/headnotes/{caseid}.txt','w') as f:
                f.write(headnotes)
    except:
        print(f'Case ID {caseid} parsing failed')

In [9]:
parsed_opinions_path = 'presumm_data/parsed_text/opinions'
tokenized_opinions_path = 'presumm_data/tokenized_text/opinions'
tokenize(parsed_opinions_path,tokenized_opinions_path)

Preparing to tokenize C:\Users\gufra\OneDrive\Documents\Academics\AdvancedTopicsInDataScience\final_project\presumm_data\parsed_text\opinions to C:\Users\gufra\OneDrive\Documents\Academics\AdvancedTopicsInDataScience\final_project\presumm_data\tokenized_text\opinions...
Making list of files to tokenize...
Tokenizing 3693 files in C:\Users\gufra\OneDrive\Documents\Academics\AdvancedTopicsInDataScience\final_project\presumm_data\parsed_text\opinions and saving in C:\Users\gufra\OneDrive\Documents\Academics\AdvancedTopicsInDataScience\final_project\presumm_data\tokenized_text\opinions...
Stanford CoreNLP Tokenizer has finished.


In [10]:
parsed_headnotes_path = 'presumm_data/parsed_text/headnotes'
tokenized_headnotes_path = 'presumm_data/tokenized_text/headnotes'
tokenize(parsed_headnotes_path,tokenized_headnotes_path)

Preparing to tokenize C:\Users\gufra\OneDrive\Documents\Academics\AdvancedTopicsInDataScience\final_project\presumm_data\parsed_text\headnotes to C:\Users\gufra\OneDrive\Documents\Academics\AdvancedTopicsInDataScience\final_project\presumm_data\tokenized_text\headnotes...
Making list of files to tokenize...
Tokenizing 3693 files in C:\Users\gufra\OneDrive\Documents\Academics\AdvancedTopicsInDataScience\final_project\presumm_data\parsed_text\headnotes and saving in C:\Users\gufra\OneDrive\Documents\Academics\AdvancedTopicsInDataScience\final_project\presumm_data\tokenized_text\headnotes...
Stanford CoreNLP Tokenizer has finished.


# Converting to JSON

In [3]:

REMAP = {"-lrb-": "(", "-rrb-": ")", "-lcb-": "{", "-rcb-": "}",
         "-lsb-": "[", "-rsb-": "]", "``": '"', "''": '"'}


def clean(x):
    return re.sub(
        r"-lrb-|-rrb-|-lcb-|-rcb-|-lsb-|-rsb-|``|''",
        lambda m: REMAP.get(m.group()), x)

def load_json(case_id):
    source = []
    tgt = []
    source_path = os.path.join('presumm_data/tokenized_text/opinions',f'{case_id}.txt.json')
    target_path = os.path.join('presumm_data/tokenized_text/headnotes',f'{case_id}.txt.json')
    for sent in json.load(open(source_path,encoding='utf-8'))['sentences']:
        tokens = [t['word'].encode("ascii", "ignore").strip().decode("utf-8") for t in sent['tokens']]
        tokens = [t.lower() for t in tokens]
        source.append(tokens)
    for sent in json.load(open(target_path,encoding='utf-8'))['sentences']:
        tokens = [t['word'].encode("ascii", "ignore").strip().decode("utf-8") for t in sent['tokens']]
        tokens = [t.lower() for t in tokens]
        tgt.append(tokens)


    source = [clean(' '.join(sent)).split() for sent in source]
    tgt = [clean(' '.join(sent)).split() for sent in tgt]
    return source, tgt

### Greedy Selection

In [4]:
import re

def _get_ngrams(n, text):
    """Calcualtes n-grams.

    Args:
      n: which n-grams to calculate
      text: An array of tokens

    Returns:
      A set of n-grams
    """
    ngram_set = set()
    text_length = len(text)
    max_index_ngram_start = text_length - n
    for i in range(max_index_ngram_start + 1):
        ngram_set.add(tuple(text[i:i + n]))
    return ngram_set


def _get_word_ngrams(n, sentences):
    """Calculates word n-grams for multiple sentences.
    """
    assert len(sentences) > 0
    assert n > 0

    # words = _split_into_words(sentences)

    words = sum(sentences, [])
    # words = [w for w in words if w not in stopwords]
    return _get_ngrams(n, words)


def cal_rouge(evaluated_ngrams, reference_ngrams):
    reference_count = len(reference_ngrams)
    evaluated_count = len(evaluated_ngrams)

    overlapping_ngrams = evaluated_ngrams.intersection(reference_ngrams)
    overlapping_count = len(overlapping_ngrams)

    if evaluated_count == 0:
        precision = 0.0
    else:
        precision = overlapping_count / evaluated_count

    if reference_count == 0:
        recall = 0.0
    else:
        recall = overlapping_count / reference_count

    f1_score = 2.0 * ((precision * recall) / (precision + recall + 1e-8))
    return {"f": f1_score, "p": precision, "r": recall}


def greedy_selection(doc_sent_list, abstract_sent_list, summary_size):
    def _rouge_clean(s):
        return re.sub(r'[^a-zA-Z0-9 ]', '', s)
   
    max_rouge = 0.0
    abstract = sum(abstract_sent_list, [])
    #abstract = abstract_sent_list
    abstract = _rouge_clean(' '.join(abstract)).split()
    sents = [_rouge_clean(' '.join(s)).split() for s in doc_sent_list]
    evaluated_1grams = [_get_word_ngrams(1, [sent]) for sent in sents]
    #print(evaluated_1grams)
    reference_1grams = _get_word_ngrams(1, [abstract])
    evaluated_2grams = [_get_word_ngrams(2, [sent]) for sent in sents]
    reference_2grams = _get_word_ngrams(2, [abstract])

    selected = []

    for s in range(summary_size):
        cur_max_rouge = max_rouge
        cur_id = -1
        
        for i in range(len(sents)):
            if (i in selected):
                continue
                
            c = selected + [i]
            candidates_1 = [evaluated_1grams[idx] for idx in c]
            candidates_1 = set.union(*map(set, candidates_1))
            candidates_2 = [evaluated_2grams[idx] for idx in c]
            candidates_2 = set.union(*map(set, candidates_2))
            rouge_1 = cal_rouge(candidates_1, reference_1grams)['f']
            rouge_2 = cal_rouge(candidates_2, reference_2grams)['f']
            rouge_score = rouge_1 + rouge_2           
            if rouge_score > cur_max_rouge:
                cur_max_rouge = rouge_score
                cur_id = i
        if (cur_id == -1):
            return sorted(selected)
        selected.append(cur_id)
        max_rouge = cur_max_rouge
    
    
    return sorted(selected)

### Bert Data

In [5]:
max_src_nsents =10000
class BertData():
    def __init__(self, min_src_ntokens_per_sent=5,
                max_src_ntokens_per_sent=200,
                max_src_nsents=max_src_nsents,
                min_src_nsents=1,
                max_tgt_ntokens=500,
                min_tgt_ntokens=5):
        self.min_src_ntokens_per_sent = min_src_ntokens_per_sent
        self.max_src_ntokens_per_sent = max_src_ntokens_per_sent
        self.max_src_nsents = max_src_nsents
        self.min_src_nsents = min_src_nsents
        self.max_tgt_ntokens = max_tgt_ntokens
        self.min_tgt_ntokens = min_tgt_ntokens
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

        self.sep_token = '[SEP]'
        self.cls_token = '[CLS]'
        self.pad_token = '[PAD]'
        self.tgt_bos = '[unused0]'
        self.tgt_eos = '[unused1]'
        self.tgt_sent_split = '[unused2]'
        self.sep_vid = self.tokenizer.vocab[self.sep_token]
        self.cls_vid = self.tokenizer.vocab[self.cls_token]
        self.pad_vid = self.tokenizer.vocab[self.pad_token]

    def preprocess(self, src, tgt, sent_labels, use_bert_basic_tokenizer=False, is_test=False):

        if ((not is_test) and len(src) == 0):
            return None

        original_src_txt = [' '.join(s) for s in src]

        idxs = [i for i, s in enumerate(src) if (len(s) > self.min_src_ntokens_per_sent)]

        _sent_labels = [0] * len(src)
        for l in sent_labels:
            _sent_labels[l] = 1

        src = [src[i][:self.max_src_ntokens_per_sent] for i in idxs]
        sent_labels = [_sent_labels[i] for i in idxs]
        src = src[:self.max_src_nsents]
        sent_labels = sent_labels[:self.max_src_nsents]

        if ((not is_test) and len(src) < self.min_src_nsents):
            return None

        src_txt = [' '.join(sent) for sent in src]
        text = ' {} {} '.format(self.sep_token, self.cls_token).join(src_txt)

        src_subtokens = self.tokenizer.tokenize(text)

        src_subtokens = [self.cls_token] + src_subtokens + [self.sep_token]
        src_subtoken_idxs = self.tokenizer.convert_tokens_to_ids(src_subtokens)
        _segs = [-1] + [i for i, t in enumerate(src_subtoken_idxs) if t == self.sep_vid]
        segs = [_segs[i] - _segs[i - 1] for i in range(1, len(_segs))]
        segments_ids = []
        for i, s in enumerate(segs):
            if (i % 2 == 0):
                segments_ids += s * [0]
            else:
                segments_ids += s * [1]
        cls_ids = [i for i, t in enumerate(src_subtoken_idxs) if t == self.cls_vid]
        sent_labels = sent_labels[:len(cls_ids)]

        tgt_subtokens_str = '[unused0] ' + ' [unused2] '.join(
            [' '.join(self.tokenizer.tokenize(' '.join(tt), use_bert_basic_tokenizer=use_bert_basic_tokenizer)) for tt in tgt]) + ' [unused1]'
        tgt_subtoken = tgt_subtokens_str.split()[:self.max_tgt_ntokens]
        if ((not is_test) and len(tgt_subtoken) < self.min_tgt_ntokens):
            return None

        tgt_subtoken_idxs = self.tokenizer.convert_tokens_to_ids(tgt_subtoken)

        tgt_txt = '<q>'.join([' '.join(tt) for tt in tgt])
        src_txt = [original_src_txt[i] for i in idxs]

        return src_subtoken_idxs, sent_labels, tgt_subtoken_idxs, segments_ids, cls_ids, src_txt, tgt_txt


In [14]:
case_files = os.listdir('./presumm_data/tokenized_text/opinions')
case_ids = [case_file.replace(".txt.json","") for case_file in case_files]
parsed_files = [case_id.replace(".json","") for case_id in os.listdir('./presumm_data/json_data')]
#case_ids = list(set(case_ids).difference(parsed_files))
len(case_ids)

3693

In [17]:
def generate_bert_data(case_id):
    source, tgt = load_json(case_id)
    sent_labels = greedy_selection(source[:max_src_nsents], tgt, 5)
    source = [' '.join(s).lower().split() for s in source]
    tgt = [' '.join(s).lower().split() for s in tgt]
    bert = BertData()
    b_data = bert.preprocess(source, tgt, sent_labels, use_bert_basic_tokenizer=True,
                                     is_test=False)
    if b_data is not None:
        src_subtoken_idxs, sent_labels, tgt_subtoken_idxs, segments_ids, cls_ids, src_txt, tgt_txt = b_data
        b_data_dict = {"src": src_subtoken_idxs, "tgt": tgt_subtoken_idxs,
                               "src_sent_labels": sent_labels, "segs": segments_ids, 'clss': cls_ids,
                               'src_txt': src_txt, "tgt_txt": tgt_txt}
        return (case_id,b_data_dict)


In [18]:
from mutliprocessing_funcs import generate_bert_data

In [19]:
from multiprocessing import Pool
pool = Pool(32)
for b_data_tp in pool.imap_unordered(generate_bert_data,case_ids):
    if b_data_tp is not None:
        with open(f'./presumm_data/json_data/{b_data_tp[0]}.json', 'w') as fp:
            json.dump(b_data_tp[1], fp)
pool.close()
pool.join()

### Create Train, test and validation datasets

In [20]:
all_cases = [case_id.replace(".json","") for case_id in os.listdir('./presumm_data/json_data/')]

In [21]:
num_cases = len(all_cases)
train_cases = int(np.ceil(num_cases*0.8))
val_cases = int(np.ceil((num_cases-train_cases)/2))
test_cases = num_cases-val_cases-train_cases
all_index = np.arange(num_cases)
np.random.seed(1)
np.random.shuffle(all_index)
train_indices =all_index[:train_cases]
val_indices = all_index[train_cases:train_cases+val_cases]
test_indices = all_index[train_cases+val_cases:] 

In [22]:
train_cases = np.array(all_cases)[train_indices]
val_cases = np.array(all_cases)[val_indices]
test_cases = np.array(all_cases)[test_indices]

In [23]:
def append_samples(case_list):
    appended_samples = []
    for case_id in case_list:
        try:
            with open(f'./presumm_data/json_data/{case_id}.json','r') as f:
                case_content = f.read()
                case_content = json.loads(case_content)
            appended_samples.append(case_content)
        except:
            print(f'Error reading case {case_id}')
    return appended_samples

In [24]:
train_dataset = append_samples(train_cases)
val_dataset = append_samples(val_cases)
test_dataset = append_samples(test_cases)

In [25]:
torch.save(train_dataset, 'presumm_data/train_dataset.pt')
torch.save(val_dataset, 'presumm_data/val_dataset.pt')
torch.save(test_dataset, 'presumm_data/test_dataset.pt')

In [26]:
test_sample_dataset = append_samples(test_cases[:10])
len(test_sample_dataset)

10

In [27]:
torch.save(test_sample_dataset, 'presumm/bert_data/sample/cnndm.test.1.bert.pt')

In [28]:
with open('test_cases.json','w') as f:
    json.dump(test_cases.tolist(),f)

with open('train_cases.json','w') as f:
    json.dump(train_cases.tolist(),f)

with open('val_cases.json','w') as f:
    json.dump(val_cases.tolist(),f)


### Data Prep for Matchsum data

In [7]:
# For Match Sum Data
with open('test_cases.json','r') as f:
    test_cases = json.load(f)

text_summary=[]
sent_id = []

for case_id in test_cases:
    
    source, tgt = load_json(case_id)
    sent_labels = greedy_selection(source[:max_src_nsents], tgt, 5)
    source = [' '.join(s).lower() for s in source]
    tgt = [' '.join(s).lower() for s in tgt]
    #text_summary.append({'text':source, 'summary':tgt})
    #sent_id.append({'sent_id':sent_labels})

    with open('sentence_id.json','a+') as f:
        json.dump({'sent_id':sent_labels},f)
        f.write('\n')
    
    with open('match_summ_sample.json','a+') as f:
        json.dump({'text':source, 'summary':tgt},f)
        f.write('\n')

In [None]:
# PreSumm

**Source**:

Code: https://github.com/nlpyang/PreSumm/


Paper: https://arxiv.org/abs/1908.08345

#### Pre-requisities

**Libraries**: 

Torch 1.1.0 (download instructions from https://pytorch.org/get-started/previous-versions/)

**Stanford CoreNLP**

We will need Stanford CoreNLP to tokenize the data. Download it [here](https://stanfordnlp.github.io/CoreNLP/) and unzip it. Then add the following command to your bash_profile:
```
export CLASSPATH=/path/to/stanford-corenlp-full-2017-06-09/stanford-corenlp-3.8.0.jar
```
replacing `/path/to/` with the path to where you saved the `stanford-corenlp-full-2017-06-09` directory. 


#### Code

from others.tokenization import BertTokenizer

import lzma
import os
import json
import re

from pandas.io.json import json_normalize
import pandas as pd
from bs4 import BeautifulSoup
import subprocess
import torch
import lxml
import numpy as np

# Read data

base_path = "./data/xml"
state='north_carolina.xz'
f = lzma.open(os.path.join(base_path,state),"rb")
state_data = f.readlines()
f.close()
data_json = [json.loads(line) for line in state_data]
print(f'Flattening data for {state}')
data = json_normalize(data_json)

data['decision_date_p'] = pd.to_datetime(data.decision_date,errors='coerce')
data['decision_year'] = data.decision_date_p.dt.year

data_2008 = data[data.decision_year>=2008]

# Tokenize Data

def tokenize(raw_path,save_path):
    stories_dir = os.path.abspath(raw_path)
    tokenized_stories_dir = os.path.abspath(save_path)

    print("Preparing to tokenize %s to %s..." % (stories_dir, tokenized_stories_dir))
    stories = os.listdir(stories_dir)
    # make IO list file
    print("Making list of files to tokenize...")
    with open("mapping_for_corenlp.txt", "w") as f:
        for s in stories:
            f.write("%s\n" % (os.path.join(stories_dir, s)))
    command = ['java', 'edu.stanford.nlp.pipeline.StanfordCoreNLP', '-annotators', 'tokenize,ssplit',
               '-ssplit.newlineIsSentenceBreak', 'always', '-filelist', 'mapping_for_corenlp.txt', '-outputFormat',
               'json', '-outputDirectory', tokenized_stories_dir]
    print("Tokenizing %i files in %s and saving in %s..." % (len(stories), stories_dir, tokenized_stories_dir))
    subprocess.call(command)
    print("Stanford CoreNLP Tokenizer has finished.")
    os.remove("mapping_for_corenlp.txt")


sample_data = data_2008.iloc[:100]

for row in data_2008.iterrows():
    try:
        caseid = row[1].id
        markup = row[1]['casebody.data']
        soup = BeautifulSoup(markup, "xml")
        opinion = soup.find_all('opinion')[0]
        opinion_text = opinion.getText()
        opinion_text = opinion_text.encode("ascii", "ignore").strip().decode("ascii")
        headnotes = (' '.join([headnotes.getText() for headnotes in soup.find_all('headnotes')])).replace('\n', ' ')
        headnotes = headnotes.encode("ascii", "ignore").strip().decode("ascii")

        if (len(headnotes) > 150 and len(opinion_text)>len(headnotes)):
            with open(f'presumm_data/parsed_text/opinions/{caseid}.txt','w') as f:
                f.write(opinion_text)

            with open(f'presumm_data/parsed_text/headnotes/{caseid}.txt','w') as f:
                f.write(headnotes)
    except:
        print(f'Case ID {caseid} parsing failed')

parsed_opinions_path = 'presumm_data/parsed_text/opinions'
tokenized_opinions_path = 'presumm_data/tokenized_text/opinions'
tokenize(parsed_opinions_path,tokenized_opinions_path)

parsed_headnotes_path = 'presumm_data/parsed_text/headnotes'
tokenized_headnotes_path = 'presumm_data/tokenized_text/headnotes'
tokenize(parsed_headnotes_path,tokenized_headnotes_path)

# Converting to JSON


REMAP = {"-lrb-": "(", "-rrb-": ")", "-lcb-": "{", "-rcb-": "}",
         "-lsb-": "[", "-rsb-": "]", "``": '"', "''": '"'}


def clean(x):
    return re.sub(
        r"-lrb-|-rrb-|-lcb-|-rcb-|-lsb-|-rsb-|``|''",
        lambda m: REMAP.get(m.group()), x)

def load_json(case_id):
    source = []
    tgt = []
    source_path = os.path.join('presumm_data/tokenized_text/opinions',f'{case_id}.txt.json')
    target_path = os.path.join('presumm_data/tokenized_text/headnotes',f'{case_id}.txt.json')
    for sent in json.load(open(source_path,encoding='utf-8'))['sentences']:
        tokens = [t['word'].encode("ascii", "ignore").strip().decode("utf-8") for t in sent['tokens']]
        tokens = [t.lower() for t in tokens]
        source.append(tokens)
    for sent in json.load(open(target_path,encoding='utf-8'))['sentences']:
        tokens = [t['word'].encode("ascii", "ignore").strip().decode("utf-8") for t in sent['tokens']]
        tokens = [t.lower() for t in tokens]
        tgt.append(tokens)


    source = [clean(' '.join(sent)).split() for sent in source]
    tgt = [clean(' '.join(sent)).split() for sent in tgt]
    return source, tgt

### Greedy Selection

import re

def _get_ngrams(n, text):
    """Calcualtes n-grams.

    Args:
      n: which n-grams to calculate
      text: An array of tokens

    Returns:
      A set of n-grams
    """
    ngram_set = set()
    text_length = len(text)
    max_index_ngram_start = text_length - n
    for i in range(max_index_ngram_start + 1):
        ngram_set.add(tuple(text[i:i + n]))
    return ngram_set


def _get_word_ngrams(n, sentences):
    """Calculates word n-grams for multiple sentences.
    """
    assert len(sentences) > 0
    assert n > 0

    # words = _split_into_words(sentences)

    words = sum(sentences, [])
    # words = [w for w in words if w not in stopwords]
    return _get_ngrams(n, words)


def cal_rouge(evaluated_ngrams, reference_ngrams):
    reference_count = len(reference_ngrams)
    evaluated_count = len(evaluated_ngrams)

    overlapping_ngrams = evaluated_ngrams.intersection(reference_ngrams)
    overlapping_count = len(overlapping_ngrams)

    if evaluated_count == 0:
        precision = 0.0
    else:
        precision = overlapping_count / evaluated_count

    if reference_count == 0:
        recall = 0.0
    else:
        recall = overlapping_count / reference_count

    f1_score = 2.0 * ((precision * recall) / (precision + recall + 1e-8))
    return {"f": f1_score, "p": precision, "r": recall}


def greedy_selection(doc_sent_list, abstract_sent_list, summary_size):
    def _rouge_clean(s):
        return re.sub(r'[^a-zA-Z0-9 ]', '', s)
   
    max_rouge = 0.0
    abstract = sum(abstract_sent_list, [])
    #abstract = abstract_sent_list
    abstract = _rouge_clean(' '.join(abstract)).split()
    sents = [_rouge_clean(' '.join(s)).split() for s in doc_sent_list]
    evaluated_1grams = [_get_word_ngrams(1, [sent]) for sent in sents]
    #print(evaluated_1grams)
    reference_1grams = _get_word_ngrams(1, [abstract])
    evaluated_2grams = [_get_word_ngrams(2, [sent]) for sent in sents]
    reference_2grams = _get_word_ngrams(2, [abstract])

    selected = []

    for s in range(summary_size):
        cur_max_rouge = max_rouge
        cur_id = -1
        
        for i in range(len(sents)):
            if (i in selected):
                continue
                
            c = selected + [i]
            candidates_1 = [evaluated_1grams[idx] for idx in c]
            candidates_1 = set.union(*map(set, candidates_1))
            candidates_2 = [evaluated_2grams[idx] for idx in c]
            candidates_2 = set.union(*map(set, candidates_2))
            rouge_1 = cal_rouge(candidates_1, reference_1grams)['f']
            rouge_2 = cal_rouge(candidates_2, reference_2grams)['f']
            rouge_score = rouge_1 + rouge_2           
            if rouge_score > cur_max_rouge:
                cur_max_rouge = rouge_score
                cur_id = i
        if (cur_id == -1):
            return sorted(selected)
        selected.append(cur_id)
        max_rouge = cur_max_rouge
    
    
    return sorted(selected)

### Bert Data

max_src_nsents =10000
class BertData():
    def __init__(self, min_src_ntokens_per_sent=5,
                max_src_ntokens_per_sent=200,
                max_src_nsents=max_src_nsents,
                min_src_nsents=1,
                max_tgt_ntokens=500,
                min_tgt_ntokens=5):
        self.min_src_ntokens_per_sent = min_src_ntokens_per_sent
        self.max_src_ntokens_per_sent = max_src_ntokens_per_sent
        self.max_src_nsents = max_src_nsents
        self.min_src_nsents = min_src_nsents
        self.max_tgt_ntokens = max_tgt_ntokens
        self.min_tgt_ntokens = min_tgt_ntokens
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

        self.sep_token = '[SEP]'
        self.cls_token = '[CLS]'
        self.pad_token = '[PAD]'
        self.tgt_bos = '[unused0]'
        self.tgt_eos = '[unused1]'
        self.tgt_sent_split = '[unused2]'
        self.sep_vid = self.tokenizer.vocab[self.sep_token]
        self.cls_vid = self.tokenizer.vocab[self.cls_token]
        self.pad_vid = self.tokenizer.vocab[self.pad_token]

    def preprocess(self, src, tgt, sent_labels, use_bert_basic_tokenizer=False, is_test=False):

        if ((not is_test) and len(src) == 0):
            return None

        original_src_txt = [' '.join(s) for s in src]

        idxs = [i for i, s in enumerate(src) if (len(s) > self.min_src_ntokens_per_sent)]

        _sent_labels = [0] * len(src)
        for l in sent_labels:
            _sent_labels[l] = 1

        src = [src[i][:self.max_src_ntokens_per_sent] for i in idxs]
        sent_labels = [_sent_labels[i] for i in idxs]
        src = src[:self.max_src_nsents]
        sent_labels = sent_labels[:self.max_src_nsents]

        if ((not is_test) and len(src) < self.min_src_nsents):
            return None

        src_txt = [' '.join(sent) for sent in src]
        text = ' {} {} '.format(self.sep_token, self.cls_token).join(src_txt)

        src_subtokens = self.tokenizer.tokenize(text)

        src_subtokens = [self.cls_token] + src_subtokens + [self.sep_token]
        src_subtoken_idxs = self.tokenizer.convert_tokens_to_ids(src_subtokens)
        _segs = [-1] + [i for i, t in enumerate(src_subtoken_idxs) if t == self.sep_vid]
        segs = [_segs[i] - _segs[i - 1] for i in range(1, len(_segs))]
        segments_ids = []
        for i, s in enumerate(segs):
            if (i % 2 == 0):
                segments_ids += s * [0]
            else:
                segments_ids += s * [1]
        cls_ids = [i for i, t in enumerate(src_subtoken_idxs) if t == self.cls_vid]
        sent_labels = sent_labels[:len(cls_ids)]

        tgt_subtokens_str = '[unused0] ' + ' [unused2] '.join(
            [' '.join(self.tokenizer.tokenize(' '.join(tt), use_bert_basic_tokenizer=use_bert_basic_tokenizer)) for tt in tgt]) + ' [unused1]'
        tgt_subtoken = tgt_subtokens_str.split()[:self.max_tgt_ntokens]
        if ((not is_test) and len(tgt_subtoken) < self.min_tgt_ntokens):
            return None

        tgt_subtoken_idxs = self.tokenizer.convert_tokens_to_ids(tgt_subtoken)

        tgt_txt = '<q>'.join([' '.join(tt) for tt in tgt])
        src_txt = [original_src_txt[i] for i in idxs]

        return src_subtoken_idxs, sent_labels, tgt_subtoken_idxs, segments_ids, cls_ids, src_txt, tgt_txt


case_files = os.listdir('./presumm_data/tokenized_text/opinions')
case_ids = [case_file.replace(".txt.json","") for case_file in case_files]
parsed_files = [case_id.replace(".json","") for case_id in os.listdir('./presumm_data/json_data')]
#case_ids = list(set(case_ids).difference(parsed_files))
len(case_ids)

def generate_bert_data(case_id):
    source, tgt = load_json(case_id)
    sent_labels = greedy_selection(source[:max_src_nsents], tgt, 5)
    source = [' '.join(s).lower().split() for s in source]
    tgt = [' '.join(s).lower().split() for s in tgt]
    bert = BertData()
    b_data = bert.preprocess(source, tgt, sent_labels, use_bert_basic_tokenizer=True,
                                     is_test=False)
    if b_data is not None:
        src_subtoken_idxs, sent_labels, tgt_subtoken_idxs, segments_ids, cls_ids, src_txt, tgt_txt = b_data
        b_data_dict = {"src": src_subtoken_idxs, "tgt": tgt_subtoken_idxs,
                               "src_sent_labels": sent_labels, "segs": segments_ids, 'clss': cls_ids,
                               'src_txt': src_txt, "tgt_txt": tgt_txt}
        return (case_id,b_data_dict)


from mutliprocessing_funcs import generate_bert_data

from multiprocessing import Pool
pool = Pool(32)
for b_data_tp in pool.imap_unordered(generate_bert_data,case_ids):
    if b_data_tp is not None:
        with open(f'./presumm_data/json_data/{b_data_tp[0]}.json', 'w') as fp:
            json.dump(b_data_tp[1], fp)
pool.close()
pool.join()

### Create Train, test and validation datasets

all_cases = [case_id.replace(".json","") for case_id in os.listdir('./presumm_data/json_data/')]

num_cases = len(all_cases)
train_cases = int(np.ceil(num_cases*0.8))
val_cases = int(np.ceil((num_cases-train_cases)/2))
test_cases = num_cases-val_cases-train_cases
all_index = np.arange(num_cases)
np.random.seed(1)
np.random.shuffle(all_index)
train_indices =all_index[:train_cases]
val_indices = all_index[train_cases:train_cases+val_cases]
test_indices = all_index[train_cases+val_cases:] 

train_cases = np.array(all_cases)[train_indices]
val_cases = np.array(all_cases)[val_indices]
test_cases = np.array(all_cases)[test_indices]

def append_samples(case_list):
    appended_samples = []
    for case_id in case_list:
        try:
            with open(f'./presumm_data/json_data/{case_id}.json','r') as f:
                case_content = f.read()
                case_content = json.loads(case_content)
            appended_samples.append(case_content)
        except:
            print(f'Error reading case {case_id}')
    return appended_samples

train_dataset = append_samples(train_cases)
val_dataset = append_samples(val_cases)
test_dataset = append_samples(test_cases)

torch.save(train_dataset, 'presumm_data/train_dataset.pt')
torch.save(val_dataset, 'presumm_data/val_dataset.pt')
torch.save(test_dataset, 'presumm_data/test_dataset.pt')

test_sample_dataset = append_samples(test_cases[:10])
len(test_sample_dataset)

torch.save(test_sample_dataset, 'presumm/bert_data/sample/cnndm.test.1.bert.pt')

with open('test_cases.json','w') as f:
    json.dump(test_cases.tolist(),f)

with open('train_cases.json','w') as f:
    json.dump(train_cases.tolist(),f)

with open('val_cases.json','w') as f:
    json.dump(val_cases.tolist(),f)


### Data Prep for Matchsum data

# For Match Sum Data
with open('test_cases.json','r') as f:
    test_cases = json.load(f)

text_summary=[]
sent_id = []

for case_id in test_cases:
    
    source, tgt = load_json(case_id)
    sent_labels = greedy_selection(source[:max_src_nsents], tgt, 5)
    source = [' '.join(s).lower() for s in source]
    tgt = [' '.join(s).lower() for s in tgt]
    #text_summary.append({'text':source, 'summary':tgt})
    #sent_id.append({'sent_id':sent_labels})

    with open('sentence_id.json','a+') as f:
        json.dump({'sent_id':sent_labels},f)
        f.write('\n')
    
    with open('match_summ_sample.json','a+') as f:
        json.dump({'text':source, 'summary':tgt},f)
        f.write('\n')

This notebook tests the baseline performance on models pretrained on CNN/DailyMail data

# Model Training and Evaluation

### Parameter Declaration

In [1]:
from argparse import Namespace
from train_abstractive import test_abs,train_abs_single
from train_extractive import test_ext,train_single_ext


In [2]:
arg_params = {'accum_count':1,
'alpha':0.6,
'batch_size':300,
'beam_size':5,
'beta1':0.9,
'beta2':0.999,
'block_trigram':True,
'dec_dropout':0.2,
'dec_ff_size':2048,
'dec_heads':8,
'dec_hidden_size':768,
'dec_layers':6,
'enc_dropout':0.2,
'enc_ff_size':512,
'enc_hidden_size':512,
'enc_layers':6,
'encoder':'bert',
'ext_dropout':0.2,
'ext_ff_size':2048,
'ext_heads':8,
'ext_hidden_size':768,
'ext_layers':2,
'finetune_bert':True,
'generator_shard_size':32,
'gpu_ranks':[0],
'label_smoothing':0.1,
'large':False,
'load_from_extractive':'',
'lr':1,
'lr_bert':0.002,
'lr_dec':0.002,
'max_grad_norm':0,
'max_length':150,
'max_pos':512,
'max_tgt_len':140,
'min_length':15,
'optim':'adam',
'param_init':0,
'param_init_glorot':True,
'recall_eval':False,
'report_every':1,
'report_rouge':True,
'save_checkpoint_steps':15,
'seed':666,
'sep_optim':False,
'share_emb':False,
'temp_dir':'./temp',
'test_all':False,
'test_batch_size':200,
'test_start_from':-1,
'train_from':'',
'train_steps':1000,
'use_bert_emb':False,
'use_interval':True,
'visible_gpus':'-1',
'warmup_steps':8000,
'warmup_steps_bert':8000,
'warmup_steps_dec':8000,
'world_size':1}

# Baseline Evaluation (Pre-tained model)

## Extractive Summarization

In [3]:
ext_args_dict = arg_params
ext_args_dict.update({
            'bert_data_path':'./data',
    'log_file':'./logs/ext_baseline',
    'model_path':'./model_files/pre_trained/ext',
    'result_path':'./results/pre_trained/ext_baseline',
    'test_from':'./model_files/pre_trained/ext/bertext_cnndm_transformer.pt',
    'task':'ext',
    'mode':'test',

    'batch_size':300,
    'ext_dropout':0.1
})

args = Namespace(**ext_args_dict)

In [4]:
test_ext(args, device_id=-1, pt=args.test_from, step=0)

100%|██████████| 433/433 [00:00<00:00, 349659.92B/s]

Namespace(accum_count=1, alpha=0.6, batch_size=300, beam_size=5, bert_data_path='./data', beta1=0.9, beta2=0.999, block_trigram=True, dec_dropout=0.2, dec_ff_size=2048, dec_heads=8, dec_hidden_size=768, dec_layers=6, enc_dropout=0.2, enc_ff_size=512, enc_hidden_size=512, enc_layers=6, encoder='bert', ext_dropout=0.1, ext_ff_size=2048, ext_heads=8, ext_hidden_size=768, ext_layers=2, finetune_bert=True, generator_shard_size=32, gpu_ranks=[0], label_smoothing=0.1, large=False, load_from_extractive='', log_file='./logs/ext_baseline', lr=1, lr_bert=0.002, lr_dec=0.002, max_grad_norm=0, max_length=150, max_pos=512, max_tgt_len=140, min_length=15, mode='test', model_path='./model_files/pre_trained/ext', optim='adam', param_init=0, param_init_glorot=True, recall_eval=False, report_every=1, report_rouge=True, result_path='./results/pre_trained/ext_baseline', save_checkpoint_steps=15, seed=666, sep_optim=False, share_emb=False, task='ext', temp_dir='./temp', test_all=False, test_batch_size=200, 


100%|██████████| 440473133/440473133 [00:09<00:00, 45928011.17B/s]


pts ['./data/test.pt']
gpu_rank 0


2020-05-09 17:26:44,770 [MainThread  ] [INFO ]  Writing summaries.
2020-05-09 17:26:44,772 [MainThread  ] [INFO ]  Processing summaries. Saving system files to ./temp/tmptfgzu6h4/system and model files to ./temp/tmptfgzu6h4/model.
2020-05-09 17:26:44,772 [MainThread  ] [INFO ]  Processing files in ./temp/rouge-tmp-2020-05-09-17-26-44/candidate/.
2020-05-09 17:26:44,815 [MainThread  ] [INFO ]  Saved processed files to ./temp/tmptfgzu6h4/system.
2020-05-09 17:26:44,816 [MainThread  ] [INFO ]  Processing files in ./temp/rouge-tmp-2020-05-09-17-26-44/reference/.
2020-05-09 17:26:44,864 [MainThread  ] [INFO ]  Saved processed files to ./temp/tmptfgzu6h4/model.
2020-05-09 17:26:44,869 [MainThread  ] [INFO ]  Written ROUGE configuration to ./temp/tmpitnfnofj/rouge_conf.xml
2020-05-09 17:26:44,869 [MainThread  ] [INFO ]  Running ROUGE with command perl ./ROUGE-1.5.5/ROUGE-1.5.5.pl -e ./ROUGE-1.5.5/data -c 95 -m -r 1000 -n 2 -a ./temp/tmpitnfnofj/rouge_conf.xml


369
369
---------------------------------------------
1 ROUGE-1 Average_R: 0.24577 (95%-conf.int. 0.23134 - 0.26120)
1 ROUGE-1 Average_P: 0.49301 (95%-conf.int. 0.47854 - 0.50676)
1 ROUGE-1 Average_F: 0.29475 (95%-conf.int. 0.28252 - 0.30729)
---------------------------------------------
1 ROUGE-2 Average_R: 0.09804 (95%-conf.int. 0.08843 - 0.10867)
1 ROUGE-2 Average_P: 0.18863 (95%-conf.int. 0.17645 - 0.20073)
1 ROUGE-2 Average_F: 0.11580 (95%-conf.int. 0.10651 - 0.12531)
---------------------------------------------
1 ROUGE-L Average_R: 0.20960 (95%-conf.int. 0.19742 - 0.22252)
1 ROUGE-L Average_P: 0.42895 (95%-conf.int. 0.41545 - 0.44275)
1 ROUGE-L Average_F: 0.25311 (95%-conf.int. 0.24279 - 0.26383)



## Abstractive Summarization (Bert)

In [5]:
abs_args_dict = arg_params

abs_args_dict.update({
            'bert_data_path':'./data',
    'log_file':'./logs/abs_bertextabs',
    'model_path':'./model_files/pre_trained/abs_bertextabs/',
    'result_path':'./results/pre_trained/abs_bertextabs',
    'test_from':'./model_files/pre_trained/abs_bertextabs/model_step_148000.pt',
    'task':'abs',
    'mode':'test',
    
    'batch_size':300,
    'test_batch_size':200,
    'max_pos':512,
    'max_length':200,
    'alpha': 0.95,
    'min_length':50,
        
    'sep_optim':True,
    'user_interval':True

})


args = Namespace(**abs_args_dict)


In [6]:
test_abs(args, device_id=-1, pt=args.test_from, step=0)

Namespace(accum_count=1, alpha=0.95, batch_size=300, beam_size=5, bert_data_path='./data', beta1=0.9, beta2=0.999, block_trigram=True, dec_dropout=0.2, dec_ff_size=2048, dec_heads=8, dec_hidden_size=768, dec_layers=6, enc_dropout=0.2, enc_ff_size=512, enc_hidden_size=512, enc_layers=6, encoder='bert', ext_dropout=0.1, ext_ff_size=2048, ext_heads=8, ext_hidden_size=768, ext_layers=2, finetune_bert=True, generator_shard_size=32, gpu_ranks=[0], label_smoothing=0.1, large=False, load_from_extractive='', log_file='./logs/abs_bertextabs', lr=1, lr_bert=0.002, lr_dec=0.002, max_grad_norm=0, max_length=200, max_pos=512, max_tgt_len=140, min_length=50, mode='test', model_path='./model_files/pre_trained/abs_bertextabs/', optim='adam', param_init=0, param_init_glorot=True, recall_eval=False, report_every=1, report_rouge=True, result_path='./results/pre_trained/abs_bertextabs', save_checkpoint_steps=15, seed=666, sep_optim=True, share_emb=False, task='abs', temp_dir='./temp', test_all=False, test_

100%|██████████| 231508/231508 [00:00<00:00, 6099608.21B/s]
2020-05-09 18:29:11,656 [MainThread  ] [INFO ]  Writing summaries.
2020-05-09 18:29:11,658 [MainThread  ] [INFO ]  Processing summaries. Saving system files to ./temp/tmpltn13h9f/system and model files to ./temp/tmpltn13h9f/model.
2020-05-09 18:29:11,658 [MainThread  ] [INFO ]  Processing files in ./temp/rouge-tmp-2020-05-09-18-29-11/candidate/.
2020-05-09 18:29:11,704 [MainThread  ] [INFO ]  Saved processed files to ./temp/tmpltn13h9f/system.
2020-05-09 18:29:11,705 [MainThread  ] [INFO ]  Processing files in ./temp/rouge-tmp-2020-05-09-18-29-11/reference/.
2020-05-09 18:29:11,757 [MainThread  ] [INFO ]  Saved processed files to ./temp/tmpltn13h9f/model.
2020-05-09 18:29:11,762 [MainThread  ] [INFO ]  Written ROUGE configuration to ./temp/tmp1t6jzo0o/rouge_conf.xml
2020-05-09 18:29:11,763 [MainThread  ] [INFO ]  Running ROUGE with command perl ./ROUGE-1.5.5/ROUGE-1.5.5.pl -e ./ROUGE-1.5.5/data -c 95 -m -r 1000 -n 2 -a ./temp/

369
369
---------------------------------------------
1 ROUGE-1 Average_R: 0.19541 (95%-conf.int. 0.18399 - 0.20711)
1 ROUGE-1 Average_P: 0.45892 (95%-conf.int. 0.44256 - 0.47679)
1 ROUGE-1 Average_F: 0.24623 (95%-conf.int. 0.23525 - 0.25664)
---------------------------------------------
1 ROUGE-2 Average_R: 0.06171 (95%-conf.int. 0.05450 - 0.06911)
1 ROUGE-2 Average_P: 0.14881 (95%-conf.int. 0.13608 - 0.16175)
1 ROUGE-2 Average_F: 0.07836 (95%-conf.int. 0.07054 - 0.08633)
---------------------------------------------
1 ROUGE-L Average_R: 0.16693 (95%-conf.int. 0.15691 - 0.17709)
1 ROUGE-L Average_P: 0.40175 (95%-conf.int. 0.38643 - 0.41851)
1 ROUGE-L Average_F: 0.21218 (95%-conf.int. 0.20277 - 0.22168)



## Abstractive Summarization (Transformers)

In [7]:
abs_args_dict = arg_params

abs_args_dict.update({
            'bert_data_path':'./data',
    'log_file':'./logs/abs_transformers',
    'model_path':'./model_files/pre_trained/abs_transformer/',
    'result_path':'./results/pre_trained/abs_transformer',
    'test_from':'./model_files/pre_trained/abs_transformer/cnndm_baseline_best.pt',
    'task':'abs',
    'mode':'test',
    
    'batch_size':300,
    'test_batch_size':200,
    'max_pos':512,
    'max_length':200,
    'min_length':50,
        
    'sep_optim':False,

})


args = Namespace(**abs_args_dict)
test_abs(args, device_id=-1, pt=args.test_from, step=0)

Namespace(accum_count=1, alpha=0.95, batch_size=300, beam_size=5, bert_data_path='./data', beta1=0.9, beta2=0.999, block_trigram=True, dec_dropout=0.2, dec_ff_size=2048, dec_heads=8, dec_hidden_size=512, dec_layers=6, enc_dropout=0.2, enc_ff_size=2048, enc_hidden_size=512, enc_layers=6, encoder='baseline', ext_dropout=0.1, ext_ff_size=2048, ext_heads=8, ext_hidden_size=768, ext_layers=2, finetune_bert=True, generator_shard_size=32, gpu_ranks=[0], label_smoothing=0.1, large=False, load_from_extractive='', log_file='./logs/abs_transformers', lr=1, lr_bert=0.002, lr_dec=0.002, max_grad_norm=0, max_length=200, max_pos=512, max_tgt_len=140, min_length=50, mode='test', model_path='./model_files/pre_trained/abs_transformer/', optim='adam', param_init=0, param_init_glorot=True, recall_eval=False, report_every=1, report_rouge=True, result_path='./results/pre_trained/abs_transformer', save_checkpoint_steps=15, seed=666, sep_optim=False, share_emb=False, task='abs', temp_dir='./temp', test_all=Fa

2020-05-09 19:21:23,479 [MainThread  ] [INFO ]  Writing summaries.
2020-05-09 19:21:23,480 [MainThread  ] [INFO ]  Processing summaries. Saving system files to ./temp/tmp1j8r5osy/system and model files to ./temp/tmp1j8r5osy/model.
2020-05-09 19:21:23,481 [MainThread  ] [INFO ]  Processing files in ./temp/rouge-tmp-2020-05-09-19-21-23/candidate/.
2020-05-09 19:21:23,527 [MainThread  ] [INFO ]  Saved processed files to ./temp/tmp1j8r5osy/system.
2020-05-09 19:21:23,528 [MainThread  ] [INFO ]  Processing files in ./temp/rouge-tmp-2020-05-09-19-21-23/reference/.
2020-05-09 19:21:23,582 [MainThread  ] [INFO ]  Saved processed files to ./temp/tmp1j8r5osy/model.
2020-05-09 19:21:23,587 [MainThread  ] [INFO ]  Written ROUGE configuration to ./temp/tmpgfavjhyn/rouge_conf.xml
2020-05-09 19:21:23,588 [MainThread  ] [INFO ]  Running ROUGE with command perl ./ROUGE-1.5.5/ROUGE-1.5.5.pl -e ./ROUGE-1.5.5/data -c 95 -m -r 1000 -n 2 -a ./temp/tmpgfavjhyn/rouge_conf.xml


369
369
---------------------------------------------
1 ROUGE-1 Average_R: 0.20233 (95%-conf.int. 0.19042 - 0.21442)
1 ROUGE-1 Average_P: 0.44334 (95%-conf.int. 0.42921 - 0.45924)
1 ROUGE-1 Average_F: 0.24971 (95%-conf.int. 0.23907 - 0.26005)
---------------------------------------------
1 ROUGE-2 Average_R: 0.06280 (95%-conf.int. 0.05640 - 0.07004)
1 ROUGE-2 Average_P: 0.13759 (95%-conf.int. 0.12712 - 0.14869)
1 ROUGE-2 Average_F: 0.07703 (95%-conf.int. 0.07082 - 0.08388)
---------------------------------------------
1 ROUGE-L Average_R: 0.17118 (95%-conf.int. 0.16133 - 0.18091)
1 ROUGE-L Average_P: 0.38531 (95%-conf.int. 0.37209 - 0.40006)
1 ROUGE-L Average_F: 0.21326 (95%-conf.int. 0.20432 - 0.22181)



# Training & Evaluation on Case Law Data

## Extractive Summarization

In [8]:
training_steps = 30
ext_args_train_dict = arg_params

ext_args_train_dict.update({
    
            'bert_data_path':'./data',
    'log_file':'./logs/train_ext',
    'mode':'train',
    'model_path':'./model_files/trained/ext',
    'result_path':'../results/trained/ext',
    'train_from':'./model_files/pre_trained/ext/bertext_cnndm_transformer.pt',

    'report_every':1,
    'save_checkpoint_steps':15,
    'batch_size':300,
    'train_steps':18001 + training_steps,
    'warmup_steps':1,
    'max_pos':512,

    'task':'ext',
    'ext_dropout':0.1,
    'lr':0.002,
    'accum_count':2,
    'use_interval':True
})


args = Namespace(**ext_args_train_dict)

In [9]:
print(args)

Namespace(accum_count=2, alpha=0.95, batch_size=300, beam_size=5, bert_data_path='./data', beta1=0.9, beta2=0.999, block_trigram=True, dec_dropout=0.2, dec_ff_size=2048, dec_heads=8, dec_hidden_size=768, dec_layers=6, enc_dropout=0.2, enc_ff_size=512, enc_hidden_size=512, enc_layers=6, encoder='bert', ext_dropout=0.1, ext_ff_size=2048, ext_heads=8, ext_hidden_size=768, ext_layers=2, finetune_bert=True, generator_shard_size=32, gpu_ranks=[0], label_smoothing=0.1, large=False, load_from_extractive='', log_file='./logs/train_ext', lr=0.002, lr_bert=0.002, lr_dec=0.002, max_grad_norm=0, max_length=200, max_pos=512, max_tgt_len=140, min_length=50, mode='train', model_path='./model_files/trained/ext', optim='adam', param_init=0, param_init_glorot=True, recall_eval=False, report_every=1, report_rouge=True, result_path='../results/trained/ext', save_checkpoint_steps=15, seed=666, sep_optim=False, share_emb=False, task='ext', temp_dir='./temp', test_all=False, test_batch_size=200, test_from='./

In [10]:
train_single_ext(args, device_id=-1)

[2020-05-09 19:21:37,445 INFO] Device ID -1
[2020-05-09 19:21:37,446 INFO] Device cpu
[2020-05-09 19:21:37,451 INFO] Loading checkpoint from ./model_files/pre_trained/ext/bertext_cnndm_transformer.pt
[2020-05-09 19:21:37,976 INFO] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at ./temp/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
[2020-05-09 19:21:37,978 INFO] Model config {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states":

dict_keys(['model', 'opt', 'optim'])
True


[2020-05-09 19:21:40,728 INFO] * number of parameters: 120512513
[2020-05-09 19:21:40,728 INFO] Start training...


gpu_rank 0
pts ['./data/train.pt']


[2020-05-09 19:21:42,474 INFO] Loading train dataset from ./data/train.pt, number of examples: 2955


Step=18001, trains_steps=18031


[2020-05-09 19:21:47,806 INFO] Step 18001/18031; xent: 1.79; lr: 0.0000149;   1 docs/s;      5 sec
[2020-05-09 19:21:52,916 INFO] Step 18002/18031; xent: 1.00; lr: 0.0000149;   1 docs/s;     10 sec
[2020-05-09 19:21:57,968 INFO] Step 18003/18031; xent: 2.12; lr: 0.0000149;   1 docs/s;     15 sec
[2020-05-09 19:22:03,144 INFO] Step 18004/18031; xent: 1.07; lr: 0.0000149;   1 docs/s;     21 sec
[2020-05-09 19:22:08,255 INFO] Step 18005/18031; xent: 1.03; lr: 0.0000149;   1 docs/s;     26 sec
[2020-05-09 19:22:13,359 INFO] Step 18006/18031; xent: 0.40; lr: 0.0000149;   1 docs/s;     31 sec
[2020-05-09 19:22:18,460 INFO] Step 18007/18031; xent: 0.30; lr: 0.0000149;   1 docs/s;     36 sec
[2020-05-09 19:22:23,953 INFO] Step 18008/18031; xent: 0.91; lr: 0.0000149;   1 docs/s;     41 sec
[2020-05-09 19:22:29,269 INFO] Step 18009/18031; xent: 0.24; lr: 0.0000149;   1 docs/s;     47 sec
[2020-05-09 19:22:34,098 INFO] Step 18010/18031; xent: 0.17; lr: 0.0000149;   1 docs/s;     52 sec
[2020-05-0

pts ['./data/train.pt']


[2020-05-09 19:24:25,923 INFO] Loading train dataset from ./data/train.pt, number of examples: 2955


### Evaluate the trained model

In [11]:
ext_args_dict = arg_params
ext_args_dict.update({
            'bert_data_path':'./data',
    'log_file':'./logs/ext_trained',
    'model_path':'./model_files/trained/ext',
    'result_path':'./results/trained/ext_trained',
    'test_from':f'./model_files/trained/ext/model_step_{str(18001+training_steps-1)}.pt',
    'task':'ext',
    'mode':'test',

    'batch_size':300,
    'ext_dropout':0.1,
    'accum_count':2
})

args = Namespace(**ext_args_dict)
test_ext(args, device_id=-1, pt=args.test_from, step=0)

[2020-05-09 19:24:26,168 INFO] Loading checkpoint from ./model_files/trained/ext/model_step_18030.pt
[2020-05-09 19:24:26,485 INFO] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at ./temp/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
[2020-05-09 19:24:26,486 INFO] Model config {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "pad_token_id": 0,
  "pruned_heads": {},
  "torchscript": false,
  "type_vocab_size": 2,


Namespace(accum_count=2, alpha=0.95, batch_size=300, beam_size=5, bert_data_path='./data', beta1=0.9, beta2=0.999, block_trigram=True, dec_dropout=0.2, dec_ff_size=2048, dec_heads=8, dec_hidden_size=768, dec_layers=6, enc_dropout=0.2, enc_ff_size=512, enc_hidden_size=512, enc_layers=6, encoder='bert', ext_dropout=0.1, ext_ff_size=2048, ext_heads=8, ext_hidden_size=768, ext_layers=2, finetune_bert=True, generator_shard_size=32, gpu_ranks=[0], label_smoothing=0.1, large=False, load_from_extractive='', log_file='./logs/ext_trained', lr=0.002, lr_bert=0.002, lr_dec=0.002, max_grad_norm=0, max_length=200, max_pos=512, max_tgt_len=140, min_length=50, mode='test', model_path='./model_files/trained/ext', optim='adam', param_init=0, param_init_glorot=True, recall_eval=False, report_every=1, report_rouge=True, result_path='./results/trained/ext_trained', save_checkpoint_steps=15, seed=666, sep_optim=False, share_emb=False, task='ext', temp_dir='./temp', test_all=False, test_batch_size=200, test_

[2020-05-09 19:24:28,865 INFO] Loading test dataset from ./data/test.pt, number of examples: 369
[2020-05-09 19:24:28,872 INFO] * number of parameters: 120512513


pts ['./data/test.pt']
gpu_rank 0


2020-05-09 19:27:07,182 [MainThread  ] [INFO ]  Writing summaries.
[2020-05-09 19:27:07,182 INFO] Writing summaries.
2020-05-09 19:27:07,184 [MainThread  ] [INFO ]  Processing summaries. Saving system files to ./temp/tmp1osroyne/system and model files to ./temp/tmp1osroyne/model.
[2020-05-09 19:27:07,184 INFO] Processing summaries. Saving system files to ./temp/tmp1osroyne/system and model files to ./temp/tmp1osroyne/model.
2020-05-09 19:27:07,185 [MainThread  ] [INFO ]  Processing files in ./temp/rouge-tmp-2020-05-09-19-27-07/candidate/.
[2020-05-09 19:27:07,185 INFO] Processing files in ./temp/rouge-tmp-2020-05-09-19-27-07/candidate/.
2020-05-09 19:27:07,229 [MainThread  ] [INFO ]  Saved processed files to ./temp/tmp1osroyne/system.
[2020-05-09 19:27:07,229 INFO] Saved processed files to ./temp/tmp1osroyne/system.
2020-05-09 19:27:07,230 [MainThread  ] [INFO ]  Processing files in ./temp/rouge-tmp-2020-05-09-19-27-07/reference/.
[2020-05-09 19:27:07,230 INFO] Processing files in ./te

369
369


[2020-05-09 19:27:21,876 INFO] Rouges at step 0 
>> ROUGE-F(1/2/3/l): 30.20/12.20/26.04
ROUGE-R(1/2/3/l): 25.13/10.28/21.49

[2020-05-09 19:27:21,877 INFO] Validation xent: 2.43483 at step 0


---------------------------------------------
1 ROUGE-1 Average_R: 0.25128 (95%-conf.int. 0.23658 - 0.26767)
1 ROUGE-1 Average_P: 0.50811 (95%-conf.int. 0.49359 - 0.52156)
1 ROUGE-1 Average_F: 0.30204 (95%-conf.int. 0.29001 - 0.31453)
---------------------------------------------
1 ROUGE-2 Average_R: 0.10276 (95%-conf.int. 0.09280 - 0.11371)
1 ROUGE-2 Average_P: 0.20040 (95%-conf.int. 0.18753 - 0.21366)
1 ROUGE-2 Average_F: 0.12198 (95%-conf.int. 0.11282 - 0.13200)
---------------------------------------------
1 ROUGE-L Average_R: 0.21493 (95%-conf.int. 0.20285 - 0.22879)
1 ROUGE-L Average_P: 0.44402 (95%-conf.int. 0.42986 - 0.45792)
1 ROUGE-L Average_F: 0.26038 (95%-conf.int. 0.25049 - 0.27132)



## Abstractive Summarization (Transformer Basline)

In [12]:
training_steps = 30
abs_args_train_dict = arg_params

abs_args_train_dict.update({
    
            'bert_data_path':'./data',
    'log_file':'./logs/train_abs_transformer',
    'mode':'train',
    'model_path':'./model_files/trained/abs_transformer',
    'result_path':'./results/trained/abs_transformer',
    'train_from':'./model_files/pre_trained/abs_transformer/cnndm_baseline_best.pt',
    'task': abs,

    
    'accum_count':5,
    'batch_size':300,
    'dec_dropout':0.1,
    'lr':0.05,
    'save_checkpoint_steps':15,
    'sep_optim':False,
    'train_steps':38001 + training_steps,
    'use_bert_emb':True,
    'warmup_steps':1,
    'report_every':50,
    'enc_hidden_size':512 ,
'enc_layers':6,
'enc_ff_size': 2048,
'enc_dropout': 0.1,
'dec_layers': 6,
'dec_hidden_size': 512,
'dec_ff_size':2048,
'encoder': 'baseline'

})


args = Namespace(**abs_args_train_dict)

In [13]:
train_abs_single(args, device_id=-1)

[2020-05-09 19:27:21,908 INFO] Namespace(accum_count=5, alpha=0.95, batch_size=300, beam_size=5, bert_data_path='./data', beta1=0.9, beta2=0.999, block_trigram=True, dec_dropout=0.1, dec_ff_size=2048, dec_heads=8, dec_hidden_size=512, dec_layers=6, enc_dropout=0.1, enc_ff_size=2048, enc_hidden_size=512, enc_layers=6, encoder='baseline', ext_dropout=0.1, ext_ff_size=2048, ext_heads=8, ext_hidden_size=768, ext_layers=2, finetune_bert=True, generator_shard_size=32, gpu_ranks=[0], label_smoothing=0.1, large=False, load_from_extractive='', log_file='./logs/train_abs_transformer', lr=0.05, lr_bert=0.002, lr_dec=0.002, max_grad_norm=0, max_length=200, max_pos=512, max_tgt_len=140, min_length=50, mode='train', model_path='./model_files/trained/abs_transformer', optim='adam', param_init=0, param_init_glorot=True, recall_eval=False, report_every=50, report_rouge=True, result_path='./results/trained/abs_transformer', save_checkpoint_steps=15, seed=666, sep_optim=False, share_emb=False, task=<buil

dict_keys(['model', 'opt', 'optims'])
False


[2020-05-09 19:27:26,314 INFO] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at ./temp/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
[2020-05-09 19:27:26,349 INFO] * number of parameters: 75951418
[2020-05-09 19:27:26,350 INFO] Start training...


gpu_rank 0
pts ['./data/train.pt']


[2020-05-09 19:27:27,544 INFO] Loading train dataset from ./data/train.pt, number of examples: 2955


Step=38001, Train_steps=38031


[2020-05-09 19:28:37,923 INFO] Saving checkpoint ./model_files/trained/abs_transformer/model_step_38010.pt
[2020-05-09 19:30:26,395 INFO] Saving checkpoint ./model_files/trained/abs_transformer/model_step_38025.pt


pts ['./data/train.pt']


[2020-05-09 19:31:11,705 INFO] Loading train dataset from ./data/train.pt, number of examples: 2955


### Evaluate the trained model

In [14]:
abs_args_dict = arg_params

abs_args_dict.update({
            'bert_data_path':'./data',
    'log_file':'./logs/abs_transformers_trained',
    'model_path':'./model_files/trained/abs_transformer/',
    'result_path':'./results/trained/abs_transformer',
    'test_from':f'./model_files/trained/abs_transformer/model_step_38025.pt',
    'task':'abs',
    'mode':'test',
    
    'batch_size':300,
    'test_batch_size':200,
    'max_pos':512,
    'max_length':200,
    'min_length':50,
        
    'sep_optim':False,

})


args = Namespace(**abs_args_dict)
test_abs(args, device_id=-1, pt=args.test_from, step=0)

[2020-05-09 19:31:11,923 INFO] Loading checkpoint from ./model_files/trained/abs_transformer/model_step_38025.pt
[2020-05-09 19:31:12,200 INFO] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at ./temp/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
[2020-05-09 19:31:12,201 INFO] Model config {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "pad_token_id": 0,
  "pruned_heads": {},
  "torchscript": false,
  "type_voca

Namespace(accum_count=5, alpha=0.95, batch_size=300, beam_size=5, bert_data_path='./data', beta1=0.9, beta2=0.999, block_trigram=True, dec_dropout=0.1, dec_ff_size=2048, dec_heads=8, dec_hidden_size=512, dec_layers=6, enc_dropout=0.1, enc_ff_size=2048, enc_hidden_size=512, enc_layers=6, encoder='baseline', ext_dropout=0.1, ext_ff_size=2048, ext_heads=8, ext_hidden_size=768, ext_layers=2, finetune_bert=True, generator_shard_size=32, gpu_ranks=[0], label_smoothing=0.1, large=False, load_from_extractive='', log_file='./logs/abs_transformers_trained', lr=0.05, lr_bert=0.002, lr_dec=0.002, max_grad_norm=0, max_length=200, max_pos=512, max_tgt_len=140, min_length=50, mode='test', model_path='./model_files/trained/abs_transformer/', optim='adam', param_init=0, param_init_glorot=True, recall_eval=False, report_every=50, report_rouge=True, result_path='./results/trained/abs_transformer', save_checkpoint_steps=15, seed=666, sep_optim=False, share_emb=False, task='abs', temp_dir='./temp', test_al

[2020-05-09 19:31:16,037 INFO] Loading test dataset from ./data/test.pt, number of examples: 369


pts ['./data/test.pt']


[2020-05-09 19:31:16,119 INFO] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at ./temp/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
[2020-05-09 20:50:40,526 INFO] Calculating Rouge
2020-05-09 20:50:40,572 [MainThread  ] [INFO ]  Writing summaries.
[2020-05-09 20:50:40,572 INFO] Writing summaries.
2020-05-09 20:50:40,574 [MainThread  ] [INFO ]  Processing summaries. Saving system files to ./temp/tmpn0wozr0p/system and model files to ./temp/tmpn0wozr0p/model.
[2020-05-09 20:50:40,574 INFO] Processing summaries. Saving system files to ./temp/tmpn0wozr0p/system and model files to ./temp/tmpn0wozr0p/model.
2020-05-09 20:50:40,577 [MainThread  ] [INFO ]  Processing files in ./temp/rouge-tmp-2020-05-09-20-50-40/candidate/.
[2020-05-09 20:50:40,577 INFO] Processing files in ./temp/rouge-tmp-2020-05-09-20-50-40/candidate/.
2020-05-09 20:50:40,625 [MainTh

369
369


[2020-05-09 20:50:53,340 INFO] Rouges at step 0 
>> ROUGE-F(1/2/3/l): 27.62/11.06/23.48
ROUGE-R(1/2/3/l): 21.51/8.69/18.08



---------------------------------------------
1 ROUGE-1 Average_R: 0.21512 (95%-conf.int. 0.20138 - 0.22990)
1 ROUGE-1 Average_P: 0.54829 (95%-conf.int. 0.53100 - 0.56577)
1 ROUGE-1 Average_F: 0.27616 (95%-conf.int. 0.26313 - 0.28898)
---------------------------------------------
1 ROUGE-2 Average_R: 0.08687 (95%-conf.int. 0.07851 - 0.09528)
1 ROUGE-2 Average_P: 0.21848 (95%-conf.int. 0.20456 - 0.23297)
1 ROUGE-2 Average_F: 0.11064 (95%-conf.int. 0.10257 - 0.11962)
---------------------------------------------
1 ROUGE-L Average_R: 0.18084 (95%-conf.int. 0.17001 - 0.19243)
1 ROUGE-L Average_P: 0.47653 (95%-conf.int. 0.45968 - 0.49282)
1 ROUGE-L Average_F: 0.23477 (95%-conf.int. 0.22417 - 0.24531)



## Abstractive Summarization (BertExt)

In [15]:
training_steps = 30
abs_args_train_dict = arg_params

abs_args_train_dict.update({
    
            'bert_data_path':'./data',
    'log_file':'./logs/train_abs',
    'mode':'train',
    'model_path':'./model_files/trained/abs',
    'result_path':'./results/trained/abs',
    'train_from':'./model_files/pre_trained/abs_bertextabs/model_step_148000.pt',
        'load_from_extractive':f'./model_files/trained/ext/model_step_{str(18001 + training_steps -1)}.pt',
    'task': abs,
    'save_checkpoint_steps':15,
    'batch_size':300,
    'train_steps':148001+training_steps,
    'report_every':1,
    'accum_count':5,
    'warmup_steps_bert':1,
    'warmup_steps_dec':1,
  
'dec_dropout':0.2,
'sep_optim':True,
'lr_bert':0.002,
'lr_dec':0.2,
'use_bert_emb':True,
'use_interval':True,
'max_pos':512
})


args = Namespace(**abs_args_train_dict)

In [16]:
train_abs_single(args, device_id=-1)

[2020-05-09 20:50:53,369 INFO] Namespace(accum_count=5, alpha=0.95, batch_size=300, beam_size=5, bert_data_path='./data', beta1=0.9, beta2=0.999, block_trigram=True, dec_dropout=0.2, dec_ff_size=2048, dec_heads=8, dec_hidden_size=512, dec_layers=6, enc_dropout=0.1, enc_ff_size=2048, enc_hidden_size=512, enc_layers=6, encoder='baseline', ext_dropout=0.1, ext_ff_size=2048, ext_heads=8, ext_hidden_size=768, ext_layers=2, finetune_bert=True, generator_shard_size=32, gpu_ranks=[0], label_smoothing=0.1, large=False, load_from_extractive='./model_files/trained/ext/model_step_18030.pt', log_file='./logs/train_abs', lr=0.05, lr_bert=0.002, lr_dec=0.2, max_grad_norm=0, max_length=200, max_pos=512, max_tgt_len=140, min_length=50, mode='train', model_path='./model_files/trained/abs', optim='adam', param_init=0, param_init_glorot=True, recall_eval=False, report_every=1, report_rouge=True, result_path='./results/trained/abs', save_checkpoint_steps=15, seed=666, sep_optim=True, share_emb=False, task=

gpu_rank 0
pts ['./data/train.pt']


[2020-05-09 20:51:00,531 INFO] Loading train dataset from ./data/train.pt, number of examples: 2955


Step=148001, Train_steps=148031


[2020-05-09 20:51:17,324 INFO] Step 148001/148031; acc:  26.90; ppl: 87.66; xent: 4.47; lr: 0.00000520;   0/ 37 tok/s;     17 sec
[2020-05-09 20:51:32,878 INFO] Step 148002/148031; acc:  34.17; ppl: 61.84; xent: 4.12; lr: 0.00000520;   0/ 34 tok/s;     32 sec
[2020-05-09 20:51:49,484 INFO] Step 148003/148031; acc:  23.77; ppl: 105.11; xent: 4.65; lr: 0.00000520;   0/ 40 tok/s;     49 sec
[2020-05-09 20:52:06,043 INFO] Step 148004/148031; acc:  17.99; ppl: 188.99; xent: 5.24; lr: 0.00000520;   0/ 42 tok/s;     66 sec
[2020-05-09 20:52:22,765 INFO] Step 148005/148031; acc:  29.32; ppl: 83.67; xent: 4.43; lr: 0.00000520;   0/ 36 tok/s;     82 sec
[2020-05-09 20:52:22,771 INFO] Saving checkpoint ./model_files/trained/abs/model_step_148005.pt
[2020-05-09 20:52:39,505 INFO] Step 148006/148031; acc:  31.46; ppl: 65.89; xent: 4.19; lr: 0.00000520;   0/ 34 tok/s;     99 sec
[2020-05-09 20:52:55,507 INFO] Step 148007/148031; acc:  29.44; ppl: 69.42; xent: 4.24; lr: 0.00000520;   0/ 38 tok/s;    

pts ['./data/train.pt']


[2020-05-09 20:59:31,727 INFO] Loading train dataset from ./data/train.pt, number of examples: 2955


### Evaluate the trained model

In [19]:
abs_args_dict = arg_params

abs_args_dict.update({
            'bert_data_path':'./data',
    'log_file':'./logs/abs_bertextabs_trained',
    'model_path':'./model_files/trained/abs_bertextabs/',
    'result_path':'./results/trained/abs_bertextabs',
    'test_from':f'./model_files/trained/abs/model_step_148020.pt',
    'task':'abs',
    'mode':'test',
    'batch_size':300,
    'test_batch_size':200,
    'max_pos':512,
    'max_length':200,
    'alpha': 0.95,
    'min_length':50,
        
    'sep_optim':True,
    'user_interval':True

})


args = Namespace(**abs_args_dict)
test_abs(args, device_id=-1, pt=args.test_from, step=0)

[2020-05-09 21:03:06,251 INFO] Loading checkpoint from ./model_files/trained/abs/model_step_148020.pt
[2020-05-09 21:03:06,758 INFO] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json from cache at ./temp/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517
[2020-05-09 21:03:06,760 INFO] Model config {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "pad_token_id": 0,
  "pruned_heads": {},
  "torchscript": false,
  "type_vocab_size": 2,

Namespace(accum_count=5, alpha=0.95, batch_size=300, beam_size=5, bert_data_path='./data', beta1=0.9, beta2=0.999, block_trigram=True, dec_dropout=0.2, dec_ff_size=2048, dec_heads=8, dec_hidden_size=768, dec_layers=6, enc_dropout=0.1, enc_ff_size=512, enc_hidden_size=512, enc_layers=6, encoder='bert', ext_dropout=0.1, ext_ff_size=2048, ext_heads=8, ext_hidden_size=768, ext_layers=2, finetune_bert=True, generator_shard_size=32, gpu_ranks=[0], label_smoothing=0.1, large=False, load_from_extractive='./model_files/trained/ext/model_step_18030.pt', log_file='./logs/abs_bertextabs_trained', lr=0.05, lr_bert=0.002, lr_dec=0.2, max_grad_norm=0, max_length=200, max_pos=512, max_tgt_len=140, min_length=50, mode='test', model_path='./model_files/trained/abs_bertextabs/', optim='adam', param_init=0, param_init_glorot=True, recall_eval=False, report_every=1, report_rouge=True, result_path='./results/trained/abs_bertextabs', save_checkpoint_steps=15, seed=666, sep_optim=True, share_emb=False, task='

[2020-05-09 21:03:10,379 INFO] Loading test dataset from ./data/test.pt, number of examples: 369


pts ['./data/test.pt']


[2020-05-09 21:03:10,457 INFO] loading file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt from cache at ./temp/26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084
[2020-05-09 22:23:27,746 INFO] Calculating Rouge
2020-05-09 22:23:27,793 [MainThread  ] [INFO ]  Writing summaries.
[2020-05-09 22:23:27,793 INFO] Writing summaries.
2020-05-09 22:23:27,795 [MainThread  ] [INFO ]  Processing summaries. Saving system files to ./temp/tmph8zoxwkw/system and model files to ./temp/tmph8zoxwkw/model.
[2020-05-09 22:23:27,795 INFO] Processing summaries. Saving system files to ./temp/tmph8zoxwkw/system and model files to ./temp/tmph8zoxwkw/model.
2020-05-09 22:23:27,797 [MainThread  ] [INFO ]  Processing files in ./temp/rouge-tmp-2020-05-09-22-23-27/candidate/.
[2020-05-09 22:23:27,797 INFO] Processing files in ./temp/rouge-tmp-2020-05-09-22-23-27/candidate/.
2020-05-09 22:23:27,845 [MainTh

369
369


[2020-05-09 22:23:42,309 INFO] Rouges at step 0 
>> ROUGE-F(1/2/3/l): 29.28/10.66/24.97
ROUGE-R(1/2/3/l): 24.29/8.91/20.57



---------------------------------------------
1 ROUGE-1 Average_R: 0.24294 (95%-conf.int. 0.22840 - 0.25752)
1 ROUGE-1 Average_P: 0.50696 (95%-conf.int. 0.49036 - 0.52454)
1 ROUGE-1 Average_F: 0.29275 (95%-conf.int. 0.28149 - 0.30372)
---------------------------------------------
1 ROUGE-2 Average_R: 0.08911 (95%-conf.int. 0.08084 - 0.09830)
1 ROUGE-2 Average_P: 0.18710 (95%-conf.int. 0.17395 - 0.20229)
1 ROUGE-2 Average_F: 0.10662 (95%-conf.int. 0.09880 - 0.11491)
---------------------------------------------
1 ROUGE-L Average_R: 0.20565 (95%-conf.int. 0.19380 - 0.21763)
1 ROUGE-L Average_P: 0.43912 (95%-conf.int. 0.42397 - 0.45664)
1 ROUGE-L Average_F: 0.24970 (95%-conf.int. 0.24002 - 0.25936)



# Matchsum

In [1]:
from preprocess.get_candidate import get_candidates_mp
from argparse import Namespace
import json
import shutil
import os
from preprocess.train_matching import test_model

In [7]:
args_dict = {'data_path':'matchsumm_data/match_summ_sample.json',
            'index_path':'matchsumm_data/sentence_id.json',
            'write_path':'data/test_CNNDM_bert.jsonl',
            'tokenizer':'bert'}
args = Namespace(**args_dict)

In [8]:
%%capture
shutil.rmtree('./temp') if os.path.isdir('./temp') else None
get_candidates_mp(args)

### Bert

In [2]:
args_dict = {'mode':'test',
'encoder':'bert',
'save_path':'matchsumm_models/',
            'candidate_num':20,
            'gpus':0,
            'encoder':'bert'}
args = Namespace(**args_dict)

In [3]:
shutil.rmtree('./temp') if os.path.isdir('./temp') else None
shutil.rmtree('data/result/') if os.path.isdir('data/result/') else None
test_model(args)

Start loading datasets !!!
Finished in 0:00:00.152812
Information of dataset is:
In total 1 datasets:
	test has 369 instances.

Current model is MatchSum_cnndm_bert.ckpt




369/369 (100.00%) decoded in 0:11:54 seconds
Start writing files !!!
Start evaluating ROUGE score !!!
---------------------------------------------
1 ROUGE-1 Average_R: 0.33545 (95%-conf.int. 0.31925 - 0.35333)
1 ROUGE-1 Average_P: 0.79428 (95%-conf.int. 0.78238 - 0.80488)
1 ROUGE-1 Average_F: 0.44129 (95%-conf.int. 0.42646 - 0.45706)
---------------------------------------------
1 ROUGE-2 Average_R: 0.24185 (95%-conf.int. 0.22784 - 0.25637)
1 ROUGE-2 Average_P: 0.58127 (95%-conf.int. 0.56360 - 0.59731)
1 ROUGE-2 Average_F: 0.31916 (95%-conf.int. 0.30535 - 0.33325)
---------------------------------------------
1 ROUGE-L Average_R: 0.31068 (95%-conf.int. 0.29551 - 0.32740)
1 ROUGE-L Average_P: 0.74182 (95%-conf.int. 0.72769 - 0.75541)
1 ROUGE-L Average_F: 0.40957 (95%-conf.int. 0.39584 - 0.42462)

Evaluate data in 730.31 seconds!
[tester] 
MatchRougeMetric: ROUGE-1=0.44129, ROUGE-2=0.31916, ROUGE-L=0.40957
Current model is MatchSum_cnndm_roberta.ckpt




369/369 (100.00%) decoded in 0:11:48 seconds
Start writing files !!!
Start evaluating ROUGE score !!!
---------------------------------------------
1 ROUGE-1 Average_R: 0.43630 (95%-conf.int. 0.41940 - 0.45346)
1 ROUGE-1 Average_P: 0.77631 (95%-conf.int. 0.76389 - 0.78864)
1 ROUGE-1 Average_F: 0.52943 (95%-conf.int. 0.51599 - 0.54246)
---------------------------------------------
1 ROUGE-2 Average_R: 0.32228 (95%-conf.int. 0.30765 - 0.33644)
1 ROUGE-2 Average_P: 0.57995 (95%-conf.int. 0.56391 - 0.59588)
1 ROUGE-2 Average_F: 0.39246 (95%-conf.int. 0.37956 - 0.40653)
---------------------------------------------
1 ROUGE-L Average_R: 0.40682 (95%-conf.int. 0.39012 - 0.42230)
1 ROUGE-L Average_P: 0.72883 (95%-conf.int. 0.71508 - 0.74245)
1 ROUGE-L Average_F: 0.49484 (95%-conf.int. 0.48198 - 0.50785)

Evaluate data in 727.33 seconds!
[tester] 
MatchRougeMetric: ROUGE-1=0.52943, ROUGE-2=0.39246, ROUGE-L=0.49484
