New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNN/DM : data preprocessing #11
Comments
step 1: use corenlp to split the sentences |
@donglixp , can you please provide details regarding how to run these 3 steps? |
def process_detokenize(chunk):
twd = TreebankWordDetokenizer()
tokenizer = BertTokenizer.from_pretrained(
args.bert_model, do_lower_case=args.do_lower_case)
r_list = []
for idx, line in chunk:
line = line.strip().replace('``', '"').replace('\'\'', '"').replace('`','\'')
s_list = [twd.detokenize(x.strip().split(
' '), convert_parentheses=True) for x in line.split('<S_SEP>')]
tk_list = [tokenizer.tokenize(s) for s in s_list]
r_list.append((idx, s_list, tk_list))
return r_list
def read_tokenized_file(fn):
with open(fn, 'r', encoding='utf-8') as f_in:
l_list = [l for l in f_in]
num_pool = min(args.processes, len(l_list))
p = Pool(num_pool)
chunk_list = partition_all(
int(len(l_list)/num_pool), list(enumerate(l_list)))
r_list = []
with tqdm(total=len(l_list)) as pbar:
for r in p.imap_unordered(process_detokenize, chunk_list):
r_list.extend(r)
pbar.update(len(r))
p.close()
p.join()
r_list.sort(key=lambda x: x[0])
return [x[1] for x in r_list], [x[2] for x in r_list]
def append_sep(s_list):
r_list = []
for i, s in enumerate(s_list):
r_list.append(s)
r_list.append('[SEP_{0}]'.format(min(9, i)))
return r_list[:-1]
## print('convert into src/tgt format')
with open(os.path.join(args.output_dir, split_out+'.src'), 'w', encoding='utf-8') as f_src, open(os.path.join(args.output_dir, split_out+'.tgt'), 'w', encoding='utf-8') as f_tgt, open(os.path.join(args.output_dir, split_out+'.slv'), 'w', encoding='utf-8') as f_slv:
for src, tgt, lb in tqdm(zip(article_tk, summary_tk, label)):
# source
src_tokenized = [' '.join(s) for s in src]
if args.src_sep_token:
f_src.write(' '.join(append_sep(src_tokenized)))
else:
f_src.write(' '.join(src_tokenized))
f_src.write('\n')
# target (silver)
slv_tokenized = [s for s, extract_flag in zip(
src_tokenized, lb) if extract_flag]
f_slv.write(' [X_SEP] '.join(slv_tokenized))
f_slv.write('\n')
# target (gold)
f_tgt.write(' [X_SEP] '.join(
[' '.join(s) for s in tgt]))
f_tgt.write('\n') The input should have been split by "<S_SEP>". |
Thank you very much @donglixp . |
@tahmedge Did you use above script? If yes, could you please share implementation of the same? |
The link to the data of CNN/DM dataset is an already preprocessed dataset.
How can we reproduce similar dataset from the official
.story
files ?The text was updated successfully, but these errors were encountered: