# <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> CS109B Data Science 2: Advanced Topics in Data Science 

##  Final Project: Milestone 3 - Final Project [70 pts]


**Harvard University**<br/>
**Spring 2020**<br/>
**Group Members**: Fernando Medeiros, Mohammed Gufran Pathan, and Prerna Aggarwal<br/>

<hr style="height:2pt">

---

In [1]:
#RUN THIS CELL 
import requests
from IPython.core.display import HTML, display
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

<div class="theme"> Final Deliverables </div>

1. **Code Report:** You are expected to submit the code you developed as part of the course project. The commented code should be provided in report format. This means that each group in a Jupyter notebook should explain—in a clean and concise report fashion—how they proceeded at every step and coding/methodology  choices . The code report should have a structure that consists of an introduction, body and conclusion.
1. **Ignite Talk:** You will present the talk on 5/11, 5/12, or 5/13. Details to come for Ignite Talk guidelines.

[Final Project Guidelines](https://docs.google.com/document/d/1Zhmm9JP4FGQBi5abFiM22e5iXYo_rr7i_vbpW0-xt8A/edit)

## PreSumm

**Source**:

Code: https://github.com/nlpyang/PreSumm/

Paper: https://arxiv.org/abs/1908.08345

#### Dependencies

**Libraries**: 

Torch 1.1.0 (download instructions from https://pytorch.org/get-started/previous-versions/)

fastNLP (to install use ```pip install fastNLP```)

pyrouge (to install use ```pip install pyrouge```)

pytorch-transformers (use ```pip install pytorch-transformers``` to import BertTokenizer from others.tokenization)

rouge (to install use ```pip install rouge```)

transformers

```git clone https://github.com/huggingface/transformers
cd transformers
pip install .```

**Stanford CoreNLP**

We will need Stanford CoreNLP to tokenize the data. Download it [here](https://stanfordnlp.github.io/CoreNLP/) and unzip it. Then add the following command to your bash_profile:
```
export CLASSPATH=/path/to/stanford-corenlp-full-2017-06-09/stanford-corenlp-3.8.0.jar
```
replacing `/path/to/` with the path to where you saved the `stanford-corenlp-full-2017-06-09` directory. 

In [2]:
# Baisc Python Libraries
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
import timeit

plt.style.use("ggplot")

In [26]:
# Project Python Lybraries
import argparse
import json
import logging
import lxml
import lzma
import multiprocessing as mp
import pickle
import queue
import re
import subprocess as sp
import sys
import tempfile
import torch

from argparse import Namespace
from bs4 import BeautifulSoup
from cytoolz import curry
from datetime import timedelta
from fastNLP.core.callback import SaveModelCallback
from fastNLP.core.tester import Tester
from fastNLP.core.trainer import Trainer
from itertools import combinations
from os.path import join, exists
from pandas.io.json import json_normalize
from pyrouge import Rouge155
from pyrouge.utils import log
from others.tokenization import BertTokenizer
from time import time
from torch import nn
from torch.nn import init
from torch.optim import Adam
from transformers import BertModel, RobertaModel
from transformers import BertTokenizer, RobertaTokenizer

#from model import MatchSum
#from callback import MyCallback
#from dataloader import MatchSumPipe
#from metrics import MarginRankingLoss, ValidMetric, MatchRougeMetric
#from utils import read_jsonl, get_data_path, get_result_path

In [4]:
r = Rouge155('./ROUGE-1.5.5')

2020-05-07 21:41:14,168 [MainThread  ] [INFO ]  Set ROUGE home directory to ./ROUGE-1.5.5.
INFO:global:Set ROUGE home directory to ./ROUGE-1.5.5.


## Read data

In [None]:
base_path = "./data/text"
state = 'north_carolina.xz'
f = lzma.open(os.path.join(base_path, state), "rb")
state_data = f.readlines()
f.close()
data_json = [json.loads(line) for line in state_data]
print(f'Flattening data for {state}')
data = json_normalize(data_json)

In [None]:
data['decision_date_p'] = pd.to_datetime(data.decision_date,errors='coerce')
data['decision_year'] = data.decision_date_p.dt.year

## Tokenize Data

In [None]:
def tokenize(raw_path,save_path):
    stories_dir = os.path.abspath(raw_path)
    tokenized_stories_dir = os.path.abspath(save_path)

    print("Preparing to tokenize %s to %s..." % (stories_dir, tokenized_stories_dir))
    stories = os.listdir(stories_dir)
    # make IO list file
    print("Making list of files to tokenize...")
    with open("mapping_for_corenlp.txt", "w") as f:
        for s in stories:
            f.write("%s\n" % (os.path.join(stories_dir, s)))
    command = ['java', 'edu.stanford.nlp.pipeline.StanfordCoreNLP', '-annotators', 'tokenize,ssplit',
               '-ssplit.newlineIsSentenceBreak', 'always', '-filelist', 'mapping_for_corenlp.txt', '-outputFormat',
               'json', '-outputDirectory', tokenized_stories_dir]
    print("Tokenizing %i files in %s and saving in %s..." % (len(stories), stories_dir, tokenized_stories_dir))
    subprocess.call(command)
    print("Stanford CoreNLP Tokenizer has finished.")
    os.remove("mapping_for_corenlp.txt")

In [None]:
sample_data = data.iloc[:10]

In [None]:
for row in sample_data.iterrows():
    caseid = row[1].id
    markup = row[1]['casebody.data']
    soup = BeautifulSoup(markup, "xml")
    opinion = soup.find_all('opinion')[0]
    opinion_text = opinion.getText()
    headnotes = ' '.join([headnote.getText() for headnote in soup.find_all('headnotes')])
    
    with open(f'presumm_data/parsed_text/opinions/{caseid}.txt','w',encoding='utf-8') as f:
        f.write(opinion_text)
    
    with open(f'presumm_data/parsed_text/headnotes/{caseid}.txt','w',encoding='utf-8') as f:
        f.write(headnotes)

In [None]:
parsed_opinions_path = 'presumm_data/parsed_text/opinions'
tokenized_opinions_path = 'presumm_data/tokenized_text/opinions'
tokenize(parsed_opinions_path, tokenized_opinions_path)

In [None]:
parsed_headnotes_path = 'presumm_data/parsed_text/headnotes'
tokenized_headnotes_path = 'presumm_data/tokenized_text/headnotes'
tokenize(parsed_headnotes_path, tokenized_headnotes_path)

## Converting to JSON

In [None]:
REMAP = {"-lrb-": "(", "-rrb-": ")", "-lcb-": "{", "-rcb-": "}",
         "-lsb-": "[", "-rsb-": "]", "``": '"', "''": '"'}


def clean(x):
    return re.sub(
        r"-lrb-|-rrb-|-lcb-|-rcb-|-lsb-|-rsb-|``|''",
        lambda m: REMAP.get(m.group()), x)

def load_json(case_id):
    source = []
    tgt = []
    source_path = os.path.join('presumm_data/tokenized_text/opinions',f'{case_id}.txt.json')
    target_path = os.path.join('presumm_data/tokenized_text/headnotes',f'{case_id}.txt.json')
    for sent in json.load(open(source_path,encoding='utf-8'))['sentences']:
        tokens = [t['word'] for t in sent['tokens']]
        tokens = [t.lower() for t in tokens]
        source.append(tokens)
    for sent in json.load(open(target_path,encoding='utf-8'))['sentences']:
        tokens = [t['word'] for t in sent['tokens']]
        tokens = [t.lower() for t in tokens]
        tgt.append(tokens)

    source = [clean(' '.join(sent)).split() for sent in source]
    tgt = [clean(' '.join(sent)).split() for sent in tgt]
    return source, tgt

### Greedy Selection

In [None]:
import re

def _get_ngrams(n, text):
    """Calcualtes n-grams.

    Args:
      n: which n-grams to calculate
      text: An array of tokens

    Returns:
      A set of n-grams
    """
    ngram_set = set()
    text_length = len(text)
    max_index_ngram_start = text_length - n
    for i in range(max_index_ngram_start + 1):
        ngram_set.add(tuple(text[i:i + n]))
    return ngram_set


def _get_word_ngrams(n, sentences):
    """Calculates word n-grams for multiple sentences.
    """
    assert len(sentences) > 0
    assert n > 0

    # words = _split_into_words(sentences)

    words = sum(sentences, [])
    # words = [w for w in words if w not in stopwords]
    return _get_ngrams(n, words)


def cal_rouge(evaluated_ngrams, reference_ngrams):
    reference_count = len(reference_ngrams)
    evaluated_count = len(evaluated_ngrams)

    overlapping_ngrams = evaluated_ngrams.intersection(reference_ngrams)
    overlapping_count = len(overlapping_ngrams)

    if evaluated_count == 0:
        precision = 0.0
    else:
        precision = overlapping_count / evaluated_count

    if reference_count == 0:
        recall = 0.0
    else:
        recall = overlapping_count / reference_count

    f1_score = 2.0 * ((precision * recall) / (precision + recall + 1e-8))
    return {"f": f1_score, "p": precision, "r": recall}


def greedy_selection(doc_sent_list, abstract_sent_list, summary_size):
    def _rouge_clean(s):
        return re.sub(r'[^a-zA-Z0-9 ]', '', s)
   
    max_rouge = 0.0
    abstract = sum(abstract_sent_list, [])
    #abstract = abstract_sent_list
    abstract = _rouge_clean(' '.join(abstract)).split()
    sents = [_rouge_clean(' '.join(s)).split() for s in doc_sent_list]
    evaluated_1grams = [_get_word_ngrams(1, [sent]) for sent in sents]
    #print(evaluated_1grams)
    reference_1grams = _get_word_ngrams(1, [abstract])
    evaluated_2grams = [_get_word_ngrams(2, [sent]) for sent in sents]
    reference_2grams = _get_word_ngrams(2, [abstract])

    selected = []

    for s in range(summary_size):
        cur_max_rouge = max_rouge
        cur_id = -1
        
        for i in range(len(sents)):
            if (i in selected):
                continue
                
            c = selected + [i]
            candidates_1 = [evaluated_1grams[idx] for idx in c]
            candidates_1 = set.union(*map(set, candidates_1))
            candidates_2 = [evaluated_2grams[idx] for idx in c]
            candidates_2 = set.union(*map(set, candidates_2))
            rouge_1 = cal_rouge(candidates_1, reference_1grams)['f']
            rouge_2 = cal_rouge(candidates_2, reference_2grams)['f']
            rouge_score = rouge_1 + rouge_2           
            if rouge_score > cur_max_rouge:
                cur_max_rouge = rouge_score
                cur_id = i
        if (cur_id == -1):
            return sorted(selected)
        selected.append(cur_id)
        max_rouge = cur_max_rouge
    
    
    return sorted(selected)

### Bert Data

In [None]:
max_src_nsents =10000
class BertData():
    def __init__(self, min_src_ntokens_per_sent=5,
                max_src_ntokens_per_sent=200,
                max_src_nsents=max_src_nsents,
                min_src_nsents=1,
                max_tgt_ntokens=500,
                min_tgt_ntokens=5):
        self.min_src_ntokens_per_sent = min_src_ntokens_per_sent
        self.max_src_ntokens_per_sent = max_src_ntokens_per_sent
        self.max_src_nsents = max_src_nsents
        self.min_src_nsents = min_src_nsents
        self.max_tgt_ntokens = max_tgt_ntokens
        self.min_tgt_ntokens = min_tgt_ntokens
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

        self.sep_token = '[SEP]'
        self.cls_token = '[CLS]'
        self.pad_token = '[PAD]'
        self.tgt_bos = '[unused0]'
        self.tgt_eos = '[unused1]'
        self.tgt_sent_split = '[unused2]'
        self.sep_vid = self.tokenizer.vocab[self.sep_token]
        self.cls_vid = self.tokenizer.vocab[self.cls_token]
        self.pad_vid = self.tokenizer.vocab[self.pad_token]

    def preprocess(self, src, tgt, sent_labels, use_bert_basic_tokenizer=False, is_test=False):

        if ((not is_test) and len(src) == 0):
            return None

        original_src_txt = [' '.join(s) for s in src]

        idxs = [i for i, s in enumerate(src) if (len(s) > self.min_src_ntokens_per_sent)]

        _sent_labels = [0] * len(src)
        for l in sent_labels:
            _sent_labels[l] = 1

        src = [src[i][:self.max_src_ntokens_per_sent] for i in idxs]
        sent_labels = [_sent_labels[i] for i in idxs]
        src = src[:self.max_src_nsents]
        sent_labels = sent_labels[:self.max_src_nsents]

        if ((not is_test) and len(src) < self.min_src_nsents):
            return None

        src_txt = [' '.join(sent) for sent in src]
        text = ' {} {} '.format(self.sep_token, self.cls_token).join(src_txt)

        src_subtokens = self.tokenizer.tokenize(text)

        src_subtokens = [self.cls_token] + src_subtokens + [self.sep_token]
        src_subtoken_idxs = self.tokenizer.convert_tokens_to_ids(src_subtokens)
        _segs = [-1] + [i for i, t in enumerate(src_subtoken_idxs) if t == self.sep_vid]
        segs = [_segs[i] - _segs[i - 1] for i in range(1, len(_segs))]
        segments_ids = []
        for i, s in enumerate(segs):
            if (i % 2 == 0):
                segments_ids += s * [0]
            else:
                segments_ids += s * [1]
        cls_ids = [i for i, t in enumerate(src_subtoken_idxs) if t == self.cls_vid]
        sent_labels = sent_labels[:len(cls_ids)]

        tgt_subtokens_str = '[unused0] ' + ' [unused2] '.join(
            [' '.join(self.tokenizer.tokenize(' '.join(tt), use_bert_basic_tokenizer=use_bert_basic_tokenizer)) for tt in tgt]) + ' [unused1]'
        tgt_subtoken = tgt_subtokens_str.split()[:self.max_tgt_ntokens]
        if ((not is_test) and len(tgt_subtoken) < self.min_tgt_ntokens):
            return None

        tgt_subtoken_idxs = self.tokenizer.convert_tokens_to_ids(tgt_subtoken)

        tgt_txt = '<q>'.join([' '.join(tt) for tt in tgt])
        src_txt = [original_src_txt[i] for i in idxs]

        return src_subtoken_idxs, sent_labels, tgt_subtoken_idxs, segments_ids, cls_ids, src_txt, tgt_txt

In [None]:
datasets = []
case_ids = ['1268383','11272108','11272573','11272694','11273468',
            '11273534','11274033','11274050','11645357','11956941']
for case_id in case_ids:
    source, tgt = load_json(case_id)
    sent_labels = greedy_selection(source[:max_src_nsents], tgt, 5)
    source = [' '.join(s).lower().split() for s in source]
    tgt = [' '.join(s).lower().split() for s in tgt]
    bert = BertData()
    b_data = bert.preprocess(source, tgt, sent_labels, use_bert_basic_tokenizer=True,
                                     is_test=False)
    if b_data is not None:
        src_subtoken_idxs, sent_labels, tgt_subtoken_idxs, segments_ids, cls_ids, src_txt, tgt_txt = b_data
        b_data_dict = {"src": src_subtoken_idxs, "tgt": tgt_subtoken_idxs,
                               "src_sent_labels": sent_labels, "segs": segments_ids, 'clss': cls_ids,
                               'src_txt': src_txt, "tgt_txt": tgt_txt}
        datasets.append(b_data_dict)

In [None]:
len(datasets)

In [None]:
datasets[0]

In [None]:
torch.save(datasets, 'presumm_data/bert_sample.pt')

In [None]:
datasets = torch.load('presumm_data/bert_sample.pt')

In [None]:
bert_df = torch.load('data/bert_data_cnndm_final/cnndm.test.0.bert.pt')

In [None]:
bert_df[0]

In [12]:
%tb
MAX_LEN = 512

_ROUGE_PATH = './ROUGE-1.5.5'
temp_path = './temp' # path to store some temporary files

original_data, sent_ids = [], []

def load_jsonl(data_path):
    data = []
    with open(data_path) as f:
        for line in f:
            data.append(json.loads(line))
    return data

def get_rouge(path, dec):
    log.get_global_console_logger().setLevel(logging.WARNING)
    dec_pattern = '(\d+).dec'
    ref_pattern = '#ID#.ref'
    dec_dir = join(path, 'decode')
    ref_dir = join(path, 'reference')

    with open(join(dec_dir, '0.dec'), 'w') as f:
        for sentence in dec:
            print(sentence, file=f)

    cmd = '-c 95 -r 1000 -n 2 -m'
    with tempfile.TemporaryDirectory() as tmp_dir:
        Rouge155.convert_summaries_to_rouge_format(
            dec_dir, join(tmp_dir, 'dec'))
        Rouge155.convert_summaries_to_rouge_format(
            ref_dir, join(tmp_dir, 'ref'))
        Rouge155.write_config_static(
            join(tmp_dir, 'dec'), dec_pattern,
            join(tmp_dir, 'ref'), ref_pattern,
            join(tmp_dir, 'settings.xml'), system_id=1
        )
        cmd = (join(_ROUGE_PATH, 'ROUGE-1.5.5.pl')
            + ' -e {} '.format(join(_ROUGE_PATH, 'data'))
            + cmd
            + ' -a {}'.format(join(tmp_dir, 'settings.xml')))
        output = sp.check_output(cmd.split(' '), universal_newlines=True)

        line = output.split('\n')
        rouge1 = float(line[3].split(' ')[3])
        rouge2 = float(line[7].split(' ')[3])
        rougel = float(line[11].split(' ')[3])
    return (rouge1 + rouge2 + rougel) / 3

@curry
def get_candidates(tokenizer, cls, sep_id, idx):

    idx_path = join(temp_path, str(idx))
    
    # create some temporary files to calculate ROUGE
    sp.call('mkdir ' + idx_path, shell=True)
    sp.call('mkdir ' + join(idx_path, 'decode'), shell=True)
    sp.call('mkdir ' + join(idx_path, 'reference'), shell=True)
    
    # load data
    data = {}
    data['text'] = original_data[idx]['text']
    data['summary'] = original_data[idx]['summary']
    
    # write reference summary to temporary files
    ref_dir = join(idx_path, 'reference')
    with open(join(ref_dir, '0.ref'), 'w') as f:
        for sentence in data['summary']:
            print(sentence, file=f)

    # get candidate summaries
    # here is for CNN/DM: truncate each document into the 5 most important sentences (using BertExt), 
    # then select any 2 or 3 sentences to form a candidate summary, so there are C(5,2)+C(5,3)=20 candidate summaries.
    # if you want to process other datasets, you may need to adjust these numbers according to specific situation.
    sent_id = sent_ids[idx]['sent_id'][:5]
    indices = list(combinations(sent_id, 2))
    indices += list(combinations(sent_id, 3))
    if len(sent_id) < 2:
        indices = [sent_id]
    
    # get ROUGE score for each candidate summary and sort them in descending order
    score = []
    for i in indices:
        i = list(i)
        i.sort()
        # write dec
        dec = []
        for j in i:
            sent = data['text'][j]
            dec.append(sent)
        score.append((i, get_rouge(idx_path, dec)))
    score.sort(key=lambda x : x[1], reverse=True)
    
    # write candidate indices and score
    data['ext_idx'] = sent_id
    data['indices'] = []
    data['score'] = []
    for i, R in score:
        data['indices'].append(list(map(int, i)))
        data['score'].append(R)

    # tokenize and get candidate_id
    candidate_summary = []
    for i in data['indices']:
        cur_summary = [cls]
        for j in i:
            cur_summary += data['text'][j].split()
        cur_summary = cur_summary[:MAX_LEN]
        cur_summary = ' '.join(cur_summary)
        candidate_summary.append(cur_summary)
    
    data['candidate_id'] = []
    for summary in candidate_summary:
        token_ids = tokenizer.encode(summary, add_special_tokens=False)[:(MAX_LEN - 1)]
        token_ids += sep_id
        data['candidate_id'].append(token_ids)
    
    # tokenize and get text_id
    text = [cls]
    for sent in data['text']:
        text += sent.split()
    text = text[:MAX_LEN]
    text = ' '.join(text)
    token_ids = tokenizer.encode(text, add_special_tokens=False)[:(MAX_LEN - 1)]
    token_ids += sep_id
    data['text_id'] = token_ids
    
    # tokenize and get summary_id
    summary = [cls]
    for sent in data['summary']:
        summary += sent.split()
    summary = summary[:MAX_LEN]
    summary = ' '.join(summary)
    token_ids = tokenizer.encode(summary, add_special_tokens=False)[:(MAX_LEN - 1)]
    token_ids += sep_id
    data['summary_id'] = token_ids
    
    # write processed data to temporary file
    processed_path = join(temp_path, 'processed')
    with open(join(processed_path, '{}.json'.format(idx)), 'w') as f:
        json.dump(data, f, indent=4) 
    
    sp.call('rm -r ' + idx_path, shell=True)

def get_candidates_mp(args):
    
    # choose tokenizer
    if args.tokenizer == 'bert':
        tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
        cls, sep = '[CLS]', '[SEP]'
    else:
        tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
        cls, sep = '<s>', '</s>'
    sep_id = tokenizer.encode(sep, add_special_tokens=False)

    # load original data and indices
    global original_data, sent_ids
    original_data = load_jsonl(args.data_path)
    original_data = original_data[0]
    sent_ids = load_jsonl(args.index_path)
    n_files = len(original_data)
    assert len(sent_ids) == len(original_data)
    print('total {} documents'.format(n_files))
    os.makedirs(temp_path)
    processed_path = join(temp_path, 'processed')
    os.makedirs(processed_path)

    # use multi-processing to get candidate summaries
    start = time()
    print('start getting candidates with multi-processing !!!')
    
    with mp.Pool() as pool:
        list(pool.imap_unordered(get_candidates(tokenizer, cls, sep_id), range(n_files), chunksize=64))
    
    print('finished in {}'.format(timedelta(seconds=time()-start)))
    
    # write processed data
    print('start writing {} files'.format(n_files))
    for i in range(n_files):
        with open(join(processed_path, '{}.json'.format(i))) as f:
            data = json.loads(f.read())
        with open(args.write_path, 'a') as f:
            print(json.dumps(data), file=f)
    
    os.system('rm -r {}'.format(temp_path))

if __name__ == '__main__':
    
    parser = argparse.ArgumentParser(
        description='Process truncated documents to obtain candidate summaries'
    )
    parser.add_argument('--tokenizer', type=str, required=True,
        help='BERT/RoBERTa')
    parser.add_argument('--data_path', type=str, required=True,
        help='path to the original dataset, the original dataset should contain text and summary')
    parser.add_argument('--index_path', type=str, required=True,
        help='indices of the remaining sentences of the truncated document')
    parser.add_argument('--write_path', type=str, required=True,
        help='path to store the processed dataset')

    args = parser.parse_args()
    assert args.tokenizer in ['bert', 'roberta']
    assert exists(args.data_path)
    assert exists(args.index_path)

    get_candidates_mp(args)

FileExistsError: [Errno 17] File exists: './temp'

usage: ipykernel_launcher.py [-h] --tokenizer TOKENIZER --data_path DATA_PATH
                             --index_path INDEX_PATH --write_path WRITE_PATH
ipykernel_launcher.py: error: the following arguments are required: --tokenizer, --data_path, --index_path, --write_path


SystemExit: 2

In [44]:
args_dict = {'tokenizer':'bert',
             'data_path':'./data/bert_data_jon/match_summ_sample.json',
             'index_path':'./data/bert_data_jon/sentence_id.json',
             'write_path':'./data/bert_data_jon/processed_data.jsonl'
            }
args = Namespace(**args_dict)

get_candidates_mp(args)

total 100 documents


FileExistsError: [Errno 17] File exists: './temp'

In [38]:
%%!
python ./preprocess/get_candidate.py --tokenizer=bert --data_path=/data/bert_data_jon/match_summ_sample.json --index_path=/data/bert_data_jon/sentence_id.json --write_path=/data/bert_data_jon/processed_data.jsonl


['Traceback (most recent call last):',
 '  File "./preprocess/get_candidate.py", line 15, in <module>',
 '    from pyrouge.utils import log',
 "ModuleNotFoundError: No module named 'pyrouge'"]

In [39]:
!ipython ./preprocess/get_candidate.py --tokenizer=bert --data_path=/data/bert_data_jon/match_summ_sample.json --index_path=/data/bert_data_jon/sentence_id.json --write_path=/data/bert_data_jon/processed_data.jsonl


]0;IPython: Harvard Extension School/case_law_g45[0;31m---------------------------------------------------------------------------[0m
[0;31mModuleNotFoundError[0m                       Traceback (most recent call last)
[0;32m~/Documents/My Education/Harvard Extension School/case_law_g45/preprocess/get_candidate.py[0m in [0;36m<module>[0;34m[0m
[1;32m     13[0m [0;34m[0m[0m
[1;32m     14[0m [0;32mfrom[0m [0mcytoolz[0m [0;32mimport[0m [0mcurry[0m[0;34m[0m[0;34m[0m[0m
[0;32m---> 15[0;31m [0;32mfrom[0m [0mpyrouge[0m[0;34m.[0m[0mutils[0m [0;32mimport[0m [0mlog[0m[0;34m[0m[0;34m[0m[0m
[0m[1;32m     16[0m [0;32mfrom[0m [0mpyrouge[0m [0;32mimport[0m [0mRouge155[0m[0;34m[0m[0;34m[0m[0m
[1;32m     17[0m [0;34m[0m[0m

[0;31mModuleNotFoundError[0m: No module named 'pyrouge'


In [41]:
%run -i './preprocess/get_candidate.py' --tokenizer=bert --data_path=/data/bert_data_jon/match_summ_sample.json --index_path=/data/bert_data_jon/sentence_id.json --write_path=/data/bert_data_jon/processed_data.jsonl


AssertionError: 

In [5]:
%tb
class MatchSum(nn.Module):
    
    def __init__(self, candidate_num, encoder, hidden_size=768):
        super(MatchSum, self).__init__()
        
        self.hidden_size = hidden_size
        self.candidate_num  = candidate_num
        
        if encoder == 'bert':
            self.encoder = BertModel.from_pretrained('bert-base-uncased')
        else:
            self.encoder = RobertaModel.from_pretrained('roberta-base')

    def forward(self, text_id, candidate_id, summary_id):
        
        batch_size = text_id.size(0)
        
        # get document embedding
        input_mask = ~(text_id == 0)
        out = self.encoder(text_id, attention_mask=input_mask)[0] # last layer
        doc_emb = out[:, 0, :]
        assert doc_emb.size() == (batch_size, self.hidden_size) # [batch_size, hidden_size]
        
        # get summary embedding
        input_mask = ~(summary_id == 0)
        out = self.encoder(summary_id, attention_mask=input_mask)[0] # last layer
        summary_emb = out[:, 0, :]
        assert summary_emb.size() == (batch_size, self.hidden_size) # [batch_size, hidden_size]

        # get summary score
        summary_score = torch.cosine_similarity(summary_emb, doc_emb, dim=-1)

        # get candidate embedding
        candidate_num = candidate_id.size(1)
        candidate_id = candidate_id.view(-1, candidate_id.size(-1))
        input_mask = ~(candidate_id == 0)
        out = self.encoder(candidate_id, attention_mask=input_mask)[0]
        candidate_emb = out[:, 0, :].view(batch_size, candidate_num, self.hidden_size)  # [batch_size, candidate_num, hidden_size]
        assert candidate_emb.size() == (batch_size, candidate_num, self.hidden_size)
        
        # get candidate score
        doc_emb = doc_emb.unsqueeze(1).expand_as(candidate_emb)
        score = torch.cosine_similarity(candidate_emb, doc_emb, dim=-1) # [batch_size, candidate_num]
        assert score.size() == (batch_size, candidate_num)

        return {'score': score, 'summary_score': summary_score}

No traceback available to show.


In [6]:
model = MatchSum(5,'bert')

In [7]:
model.eval

<bound method Module.eval of MatchSum(
  (encoder): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm(torch.Size([768]), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm(torch.Size([768]), eps=1e-12, elem

In [43]:
%%!
CUDA_VISIBLE_DEVICES=0 python ./preprocess/train_matching.py --mode=train --encoder=bert --save_path=./data/train --gpus=0,1,2,3,4,5,6,7


['Traceback (most recent call last):',
 '  File "./preprocess/train_matching.py", line 9, in <module>',
 '    from torch.optim import Adam',
 "ModuleNotFoundError: No module named 'torch'"]