# <img style="float: left; padding-right: 10px; width: 45px" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png"> CS109B Data Science 2: Advanced Topics in Data Science 

##  Final Project: Milestone 3 - Final Project [70 pts]


**Harvard University**<br/>
**Spring 2020**<br/>
**Group Members**: Fernando Medeiros, Mohammed Gufran Pathan, and Prerna Aggarwal<br/>

<hr style="height:2pt">

---

In [1]:
#RUN THIS CELL 
import requests
from IPython.core.display import HTML, display
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

<div class="theme"> Final Deliverables </div>

1. **Code Report:** You are expected to submit the code you developed as part of the course project. The commented code should be provided in report format. This means that each group in a Jupyter notebook should explain—in a clean and concise report fashion—how they proceeded at every step and coding/methodology  choices . The code report should have a structure that consists of an introduction, body and conclusion.
1. **Ignite Talk:** You will present the talk on 5/11, 5/12, or 5/13. Details to come for Ignite Talk guidelines.

[Final Project Guidelines](https://docs.google.com/document/d/1Zhmm9JP4FGQBi5abFiM22e5iXYo_rr7i_vbpW0-xt8A/edit)

## PreSumm

**Source**:

Code: https://github.com/nlpyang/PreSumm/

Paper: https://arxiv.org/abs/1908.08345

#### Dependencies

**Libraries**: 

Torch 1.1.0 (download instructions from https://pytorch.org/get-started/previous-versions/)

fastNLP (to install use ```pip install fastNLP```)

pyrouge (to install use ```pip install pyrouge```)

pytorch-transformers (use ```pip install pytorch-transformers``` to import BertTokenizer from others.tokenization)

rouge (to install use ```pip install rouge```)

transformers

```git clone https://github.com/huggingface/transformers
cd transformers
pip install .```

**Stanford CoreNLP**

We will need Stanford CoreNLP to tokenize the data. Download it [here](https://stanfordnlp.github.io/CoreNLP/) and unzip it. Then add the following command to your bash_profile:
```
export CLASSPATH=/path/to/stanford-corenlp-full-2017-06-09/stanford-corenlp-3.8.0.jar
```
replacing `/path/to/` with the path to where you saved the `stanford-corenlp-full-2017-06-09` directory. 

In [1]:
# Baisc Python Libraries
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
import timeit

plt.style.use("ggplot")

In [2]:
# Project Python Lybraries
import json
import lxml
import lzma
import pickle
import re
import subprocess
import torch

from bs4 import BeautifulSoup
from pandas.io.json import json_normalize
from pyrouge import Rouge155
from others.tokenization import BertTokenizer
from torch import nn
from torch.nn import init
from transformers import BertModel



## Read data

In [4]:
base_path = "./data/text"
state = 'north_carolina.xz'
f = lzma.open(os.path.join(base_path, state), "rb")
state_data = f.readlines()
f.close()
data_json = [json.loads(line) for line in state_data]
print(f'Flattening data for {state}')
data = json_normalize(data_json)

Flattening data for north_carolina.xz


  


In [5]:
data['decision_date_p'] = pd.to_datetime(data.decision_date,errors='coerce')
data['decision_year'] = data.decision_date_p.dt.year

## Tokenize Data

In [6]:
def tokenize(raw_path,save_path):
    stories_dir = os.path.abspath(raw_path)
    tokenized_stories_dir = os.path.abspath(save_path)

    print("Preparing to tokenize %s to %s..." % (stories_dir, tokenized_stories_dir))
    stories = os.listdir(stories_dir)
    # make IO list file
    print("Making list of files to tokenize...")
    with open("mapping_for_corenlp.txt", "w") as f:
        for s in stories:
            f.write("%s\n" % (os.path.join(stories_dir, s)))
    command = ['java', 'edu.stanford.nlp.pipeline.StanfordCoreNLP', '-annotators', 'tokenize,ssplit',
               '-ssplit.newlineIsSentenceBreak', 'always', '-filelist', 'mapping_for_corenlp.txt', '-outputFormat',
               'json', '-outputDirectory', tokenized_stories_dir]
    print("Tokenizing %i files in %s and saving in %s..." % (len(stories), stories_dir, tokenized_stories_dir))
    subprocess.call(command)
    print("Stanford CoreNLP Tokenizer has finished.")
    os.remove("mapping_for_corenlp.txt")

In [7]:
sample_data = data.iloc[:10]

In [11]:
for row in sample_data.iterrows():
    caseid = row[1].id
    markup = row[1]['casebody.data']
    soup = BeautifulSoup(markup, "xml")
    opinion = soup.find_all('opinion')[0]
    opinion_text = opinion.getText()
    headnotes = ' '.join([headnote.getText() for headnote in soup.find_all('headnotes')])
    
    with open(f'presumm_data/parsed_text/opinions/{caseid}.txt','w',encoding='utf-8') as f:
        f.write(opinion_text)
    
    with open(f'presumm_data/parsed_text/headnotes/{caseid}.txt','w',encoding='utf-8') as f:
        f.write(headnotes)

FeatureNotFound: Couldn't find a tree builder with the features you requested: xml. Do you need to install a parser library?

In [None]:
parsed_opinions_path = 'presumm_data/parsed_text/opinions'
tokenized_opinions_path = 'presumm_data/tokenized_text/opinions'
tokenize(parsed_opinions_path, tokenized_opinions_path)

In [None]:
parsed_headnotes_path = 'presumm_data/parsed_text/headnotes'
tokenized_headnotes_path = 'presumm_data/tokenized_text/headnotes'
tokenize(parsed_headnotes_path, tokenized_headnotes_path)

## Converting to JSON

In [None]:
REMAP = {"-lrb-": "(", "-rrb-": ")", "-lcb-": "{", "-rcb-": "}",
         "-lsb-": "[", "-rsb-": "]", "``": '"', "''": '"'}


def clean(x):
    return re.sub(
        r"-lrb-|-rrb-|-lcb-|-rcb-|-lsb-|-rsb-|``|''",
        lambda m: REMAP.get(m.group()), x)

def load_json(case_id):
    source = []
    tgt = []
    source_path = os.path.join('presumm_data/tokenized_text/opinions',f'{case_id}.txt.json')
    target_path = os.path.join('presumm_data/tokenized_text/headnotes',f'{case_id}.txt.json')
    for sent in json.load(open(source_path,encoding='utf-8'))['sentences']:
        tokens = [t['word'] for t in sent['tokens']]
        tokens = [t.lower() for t in tokens]
        source.append(tokens)
    for sent in json.load(open(target_path,encoding='utf-8'))['sentences']:
        tokens = [t['word'] for t in sent['tokens']]
        tokens = [t.lower() for t in tokens]
        tgt.append(tokens)

    source = [clean(' '.join(sent)).split() for sent in source]
    tgt = [clean(' '.join(sent)).split() for sent in tgt]
    return source, tgt

### Greedy Selection

In [None]:
import re

def _get_ngrams(n, text):
    """Calcualtes n-grams.

    Args:
      n: which n-grams to calculate
      text: An array of tokens

    Returns:
      A set of n-grams
    """
    ngram_set = set()
    text_length = len(text)
    max_index_ngram_start = text_length - n
    for i in range(max_index_ngram_start + 1):
        ngram_set.add(tuple(text[i:i + n]))
    return ngram_set


def _get_word_ngrams(n, sentences):
    """Calculates word n-grams for multiple sentences.
    """
    assert len(sentences) > 0
    assert n > 0

    # words = _split_into_words(sentences)

    words = sum(sentences, [])
    # words = [w for w in words if w not in stopwords]
    return _get_ngrams(n, words)


def cal_rouge(evaluated_ngrams, reference_ngrams):
    reference_count = len(reference_ngrams)
    evaluated_count = len(evaluated_ngrams)

    overlapping_ngrams = evaluated_ngrams.intersection(reference_ngrams)
    overlapping_count = len(overlapping_ngrams)

    if evaluated_count == 0:
        precision = 0.0
    else:
        precision = overlapping_count / evaluated_count

    if reference_count == 0:
        recall = 0.0
    else:
        recall = overlapping_count / reference_count

    f1_score = 2.0 * ((precision * recall) / (precision + recall + 1e-8))
    return {"f": f1_score, "p": precision, "r": recall}


def greedy_selection(doc_sent_list, abstract_sent_list, summary_size):
    def _rouge_clean(s):
        return re.sub(r'[^a-zA-Z0-9 ]', '', s)
   
    max_rouge = 0.0
    abstract = sum(abstract_sent_list, [])
    #abstract = abstract_sent_list
    abstract = _rouge_clean(' '.join(abstract)).split()
    sents = [_rouge_clean(' '.join(s)).split() for s in doc_sent_list]
    evaluated_1grams = [_get_word_ngrams(1, [sent]) for sent in sents]
    #print(evaluated_1grams)
    reference_1grams = _get_word_ngrams(1, [abstract])
    evaluated_2grams = [_get_word_ngrams(2, [sent]) for sent in sents]
    reference_2grams = _get_word_ngrams(2, [abstract])

    selected = []

    for s in range(summary_size):
        cur_max_rouge = max_rouge
        cur_id = -1
        
        for i in range(len(sents)):
            if (i in selected):
                continue
                
            c = selected + [i]
            candidates_1 = [evaluated_1grams[idx] for idx in c]
            candidates_1 = set.union(*map(set, candidates_1))
            candidates_2 = [evaluated_2grams[idx] for idx in c]
            candidates_2 = set.union(*map(set, candidates_2))
            rouge_1 = cal_rouge(candidates_1, reference_1grams)['f']
            rouge_2 = cal_rouge(candidates_2, reference_2grams)['f']
            rouge_score = rouge_1 + rouge_2           
            if rouge_score > cur_max_rouge:
                cur_max_rouge = rouge_score
                cur_id = i
        if (cur_id == -1):
            return sorted(selected)
        selected.append(cur_id)
        max_rouge = cur_max_rouge
    
    
    return sorted(selected)

### Bert Data

In [None]:
max_src_nsents =10000
class BertData():
    def __init__(self, min_src_ntokens_per_sent=5,
                max_src_ntokens_per_sent=200,
                max_src_nsents=max_src_nsents,
                min_src_nsents=1,
                max_tgt_ntokens=500,
                min_tgt_ntokens=5):
        self.min_src_ntokens_per_sent = min_src_ntokens_per_sent
        self.max_src_ntokens_per_sent = max_src_ntokens_per_sent
        self.max_src_nsents = max_src_nsents
        self.min_src_nsents = min_src_nsents
        self.max_tgt_ntokens = max_tgt_ntokens
        self.min_tgt_ntokens = min_tgt_ntokens
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

        self.sep_token = '[SEP]'
        self.cls_token = '[CLS]'
        self.pad_token = '[PAD]'
        self.tgt_bos = '[unused0]'
        self.tgt_eos = '[unused1]'
        self.tgt_sent_split = '[unused2]'
        self.sep_vid = self.tokenizer.vocab[self.sep_token]
        self.cls_vid = self.tokenizer.vocab[self.cls_token]
        self.pad_vid = self.tokenizer.vocab[self.pad_token]

    def preprocess(self, src, tgt, sent_labels, use_bert_basic_tokenizer=False, is_test=False):

        if ((not is_test) and len(src) == 0):
            return None

        original_src_txt = [' '.join(s) for s in src]

        idxs = [i for i, s in enumerate(src) if (len(s) > self.min_src_ntokens_per_sent)]

        _sent_labels = [0] * len(src)
        for l in sent_labels:
            _sent_labels[l] = 1

        src = [src[i][:self.max_src_ntokens_per_sent] for i in idxs]
        sent_labels = [_sent_labels[i] for i in idxs]
        src = src[:self.max_src_nsents]
        sent_labels = sent_labels[:self.max_src_nsents]

        if ((not is_test) and len(src) < self.min_src_nsents):
            return None

        src_txt = [' '.join(sent) for sent in src]
        text = ' {} {} '.format(self.sep_token, self.cls_token).join(src_txt)

        src_subtokens = self.tokenizer.tokenize(text)

        src_subtokens = [self.cls_token] + src_subtokens + [self.sep_token]
        src_subtoken_idxs = self.tokenizer.convert_tokens_to_ids(src_subtokens)
        _segs = [-1] + [i for i, t in enumerate(src_subtoken_idxs) if t == self.sep_vid]
        segs = [_segs[i] - _segs[i - 1] for i in range(1, len(_segs))]
        segments_ids = []
        for i, s in enumerate(segs):
            if (i % 2 == 0):
                segments_ids += s * [0]
            else:
                segments_ids += s * [1]
        cls_ids = [i for i, t in enumerate(src_subtoken_idxs) if t == self.cls_vid]
        sent_labels = sent_labels[:len(cls_ids)]

        tgt_subtokens_str = '[unused0] ' + ' [unused2] '.join(
            [' '.join(self.tokenizer.tokenize(' '.join(tt), use_bert_basic_tokenizer=use_bert_basic_tokenizer)) for tt in tgt]) + ' [unused1]'
        tgt_subtoken = tgt_subtokens_str.split()[:self.max_tgt_ntokens]
        if ((not is_test) and len(tgt_subtoken) < self.min_tgt_ntokens):
            return None

        tgt_subtoken_idxs = self.tokenizer.convert_tokens_to_ids(tgt_subtoken)

        tgt_txt = '<q>'.join([' '.join(tt) for tt in tgt])
        src_txt = [original_src_txt[i] for i in idxs]

        return src_subtoken_idxs, sent_labels, tgt_subtoken_idxs, segments_ids, cls_ids, src_txt, tgt_txt

In [None]:
datasets = []
case_ids = ['1268383','11272108','11272573','11272694','11273468',
            '11273534','11274033','11274050','11645357','11956941']
for case_id in case_ids:
    source, tgt = load_json(case_id)
    sent_labels = greedy_selection(source[:max_src_nsents], tgt, 5)
    source = [' '.join(s).lower().split() for s in source]
    tgt = [' '.join(s).lower().split() for s in tgt]
    bert = BertData()
    b_data = bert.preprocess(source, tgt, sent_labels, use_bert_basic_tokenizer=True,
                                     is_test=False)
    if b_data is not None:
        src_subtoken_idxs, sent_labels, tgt_subtoken_idxs, segments_ids, cls_ids, src_txt, tgt_txt = b_data
        b_data_dict = {"src": src_subtoken_idxs, "tgt": tgt_subtoken_idxs,
                               "src_sent_labels": sent_labels, "segs": segments_ids, 'clss': cls_ids,
                               'src_txt': src_txt, "tgt_txt": tgt_txt}
        datasets.append(b_data_dict)

In [15]:
len(datasets)

8

In [14]:
datasets[0]

{'src': [101,
  1037,
  3484,
  1997,
  1996,
  2457,
  1010,
  3568,
  1010,
  2108,
  1997,
  1996,
  3732,
  5448,
  1010,
  1996,
  3021,
  1997,
  24265,
  2001,
  8793,
  6453,
  3085,
  1024,
  1008,
  21950,
  8663,
  3366,
  15417,
  2135,
  1010,
  2588,
  2009,
  6251,
  1997,
  2331,
  2064,
  2025,
  8945,
  2979,
  1010,
  1012,
  102,
  101,
  1996,
  10684,
  1997,
  1996,
  7173,
  2383,
  2363,
  1037,
  10210,
  3775,
  7606,
  2000,
  9279,
  1996,
  7267,
  1010,
  2002,
  2097,
  1997,
  2607,
  3961,
  1999,
  7173,
  2127,
  2255,
  2744,
  1997,
  1996,
  6020,
  2457,
  1010,
  2043,
  1037,
  2047,
  3021,
  2097,
  2022,
  4567,
  1010,
  1998,
  2178,
  3979,
  2097,
  2202,
  2173,
  1012,
  102],
 'tgt': [1,
  4028,
  1012,
  3,
  1999,
  2019,
  24265,
  2005,
  4028,
  1996,
  3091,
  1998,
  5995,
  1997,
  1996,
  6357,
  2442,
  2022,
  5228,
  1012,
  3,
  1016,
  9881,
  1012,
  3,
  1038,
  1012,
  1016,
  1010,
  1039,
  1012,
  3943,
  1010,
  1

In [None]:
torch.save(datasets, 'presumm_data/bert_sample.pt')

In [3]:
datasets = torch.load('presumm_data/bert_sample.pt')

In [None]:
python get_candidate.py --tokenizer=bert --data_path=/path/to/your_original_data.jsonl \
                        --index_path=/path/to/your_index.jsonl --write_path=/path/to/store/your_processed_data.jsonl

In [4]:
model = torch.load('MatchSum_cnndm_model/MatchSum_cnndm_bert.ckpt')

ModuleNotFoundError: No module named 'model'