CNN/DM : data preprocessing #11

astariul · 2019-10-17T02:58:16Z

The link to the data of CNN/DM dataset is an already preprocessed dataset.

How can we reproduce similar dataset from the official .story files ?

The text was updated successfully, but these errors were encountered:

donglixp · 2019-10-17T04:51:04Z

step 1: use corenlp to split the sentences
step 2: run BertTokenizer to obtain subword tokens
step 3: save the source text to *.src, and the target text to *.tgt

tahmedge · 2019-11-14T06:53:41Z

@donglixp , can you please provide details regarding how to run these 3 steps?

donglixp · 2019-11-14T10:23:34Z

@donglixp , can you please provide details regarding how to run these 3 steps?

def process_detokenize(chunk):
    twd = TreebankWordDetokenizer()
    tokenizer = BertTokenizer.from_pretrained(
        args.bert_model, do_lower_case=args.do_lower_case)
    r_list = []
    for idx, line in chunk:
        line = line.strip().replace('``', '"').replace('\'\'', '"').replace('`','\'')
        s_list = [twd.detokenize(x.strip().split(
            ' '), convert_parentheses=True) for x in line.split('<S_SEP>')]
        tk_list = [tokenizer.tokenize(s) for s in s_list]
        r_list.append((idx, s_list, tk_list))
    return r_list


def read_tokenized_file(fn):
    with open(fn, 'r', encoding='utf-8') as f_in:
        l_list = [l for l in f_in]
    num_pool = min(args.processes, len(l_list))
    p = Pool(num_pool)
    chunk_list = partition_all(
        int(len(l_list)/num_pool), list(enumerate(l_list)))
    r_list = []
    with tqdm(total=len(l_list)) as pbar:
        for r in p.imap_unordered(process_detokenize, chunk_list):
            r_list.extend(r)
            pbar.update(len(r))
    p.close()
    p.join()
    r_list.sort(key=lambda x: x[0])
    return [x[1] for x in r_list], [x[2] for x in r_list]


def append_sep(s_list):
    r_list = []
    for i, s in enumerate(s_list):
        r_list.append(s)
        r_list.append('[SEP_{0}]'.format(min(9, i)))
    return r_list[:-1]


## print('convert into src/tgt format')
with open(os.path.join(args.output_dir, split_out+'.src'), 'w', encoding='utf-8') as f_src, open(os.path.join(args.output_dir, split_out+'.tgt'), 'w', encoding='utf-8') as f_tgt, open(os.path.join(args.output_dir, split_out+'.slv'), 'w', encoding='utf-8') as f_slv:
    for src, tgt, lb in tqdm(zip(article_tk, summary_tk, label)):
        # source
        src_tokenized = [' '.join(s) for s in src]
        if args.src_sep_token:
            f_src.write(' '.join(append_sep(src_tokenized)))
        else:
            f_src.write(' '.join(src_tokenized))
        f_src.write('\n')
        # target (silver)
        slv_tokenized = [s for s, extract_flag in zip(
            src_tokenized, lb) if extract_flag]
        f_slv.write(' [X_SEP] '.join(slv_tokenized))
        f_slv.write('\n')
        # target (gold)
        f_tgt.write(' [X_SEP] '.join(
            [' '.join(s) for s in tgt]))
        f_tgt.write('\n')

The input should have been split by "<S_SEP>".

tahmedge · 2019-11-14T18:42:43Z

Thank you very much @donglixp .

ranjeetds · 2019-11-25T09:10:09Z

@tahmedge Did you use above script? If yes, could you please share implementation of the same?

donglixp closed this as completed Oct 17, 2019

ranjeetds mentioned this issue Nov 28, 2019

CNN/DM Abstractive summarization data preprocessing #37

Closed

donglixp mentioned this issue Mar 23, 2020

Code for pre-processing raw dataset? #84

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNN/DM : data preprocessing #11

CNN/DM : data preprocessing #11

astariul commented Oct 17, 2019

donglixp commented Oct 17, 2019

tahmedge commented Nov 14, 2019

donglixp commented Nov 14, 2019

tahmedge commented Nov 14, 2019

ranjeetds commented Nov 25, 2019

CNN/DM : data preprocessing #11

CNN/DM : data preprocessing #11

Comments

astariul commented Oct 17, 2019

donglixp commented Oct 17, 2019

tahmedge commented Nov 14, 2019

donglixp commented Nov 14, 2019

tahmedge commented Nov 14, 2019

ranjeetds commented Nov 25, 2019