In [1]:
text = "The Pacific Economic Cooperation Council (PECC) has issued a report examining the economic impact of the COVID-19 crisis on the Asia-Pacific region and outlining an agenda for cooperation.The technical and policy-focused publication titled, ‘State of the Region: Special report on COVID-19,’ acknowledges high levels of policy uncertainty in the region due to the pandemic, as well as the scale and duration of related economic and social shocks, and pace of recovery. The report thus focuses on how regional cooperation can provide governments with more options for recovery in the face of these uncertainties.The report collates a set of proposals using data collected from 710 survey respondents in business, academia, government, and civil society between 9 May and 12 June 2020. The results show greater levels of pessimism on economic impacts to the region than official estimates indicate. The report points to regional stimulus efforts totaling approximately USD 5.4 trillion, and notes that while policymakers’ appetites are constrained by memories of recent financial crises, “regional economies have space for further stimulus.”Regional mechanisms, the authors emphasize, can facilitate the design and implementation of coordination and cooperation packages, and build a sense of direction to support future growth. They note that top priorities for regional cooperation include sharing pandemic preparedness practices, vaccine development, and three aspects of trade with respect to essential products: the facilitation of trade as a whole; the removal of export restrictions; and the removal of tariffs.Observing that the pandemic has both deepened and accelerated preexisting trends, the report notes the importance of human contact, but also opportunities around digital technology and the multitude of connections available. As schools remain shuttered in many parts of the world, the report notes remote learning opportunities, with the caveat that risks remain around the digital divide despite action being taken through the Asia-Pacific Economic Cooperation (APEC) Internet and Digital Roadmap.To address the “first order” priorities stemming from the pandemic, the report recommends multilateral actions that facilitate regional progress on seven issue areas:Information sharing;Flow of essential products;Moving beyond gross domestic product (GDP);Facilitating e-commerce;Restarting travel;Minimizing disruption to supply chains; andContact tracing.A message from PECC co-chairs Don Campbell and Su Ge highlights that this year marks PECC’s 40th anniversary and that the rationale behind the Council’s establishment still rings true today. They note that “[w]e can only resolve this pandemic and economic crisis through effective cooperation.” [Publication:State of the Region Report: Impact of the Covid‐19 Crisis] [Publication Landing Page]"

In [43]:
text

'The Pacific Economic Cooperation Council (PECC) has issued a report examining the economic impact of the COVID-19 crisis on the Asia-Pacific region and outlining an agenda for cooperation.The technical and policy-focused publication titled, ‘State of the Region: Special report on COVID-19,’ acknowledges\xa0high levels of policy uncertainty in the region due to the pandemic,\xa0as well as the scale and duration of related economic and social shocks, and pace of recovery. The report thus focuses on how regional cooperation can provide governments with more options for recovery in the face of these uncertainties.The report collates a set of proposals using data collected from 710 survey respondents in business, academia, government, and civil society between 9 May and 12 June 2020. The results show greater levels of pessimism on economic impacts to the region than official estimates indicate. The report points to regional stimulus efforts totaling approximately USD 5.4 trillion, and note

In [83]:
import re
import nltk
from transformers import AutoTokenizer
import torch


class SentenceProcessor():
    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained('roberta-base')
        self.CLS_TOKEN = self.tokenizer.cls_token_id
        self.PAD_TOKEN = self.tokenizer.pad_token_id
        self.SEP_TOKEN = self.tokenizer.sep_token_id
        
        self.num_prev_sentences = 2
        self.max_block_size = 510
        
    def preprocess(self, text):
        # clean the text by removing unnecessary characters
        text = re.sub(r'\s+', ' ', text)
        
        # split text into sentences
        sentences = nltk.tokenize.sent_tokenize(text)
        
        # convert sentences into tokens
        sentences_ids = [self.tokenizer.encode_plus(
                                    sentence, 
                                    add_special_tokens=False, 
                                    padding=False, 
                                    truncation=False,
                                    return_token_type_ids=False,
                                    return_attention_mask=False,
                                    return_tensors='pt',
                                    verbose=False)['input_ids'].tolist()[0] for sentence in sentences]
        
        ### split tokens into blocks of 512 tokens ###
        
        input_ids_blocks = []
        attention_blocks = []
        
        def add_to_input_ids_blocks(input_ids):
            assert len(input_ids) <= self.max_block_size
            
            pad_len = self.max_block_size - len(input_ids)
            
            input_ids.insert(0, self.CLS_TOKEN)
            input_ids.append(self.SEP_TOKEN)
            
            attention_mask = [1] * len(input_ids)
            
            # add padding length to make all the blocks the same size
            if pad_len > 0:
                input_ids.extend([self.PAD_TOKEN] * pad_len)
                attention_mask.extend([0] * pad_len)
            
            input_ids_blocks.append(input_ids)
            attention_blocks.append(attention_mask)
        
        
        current_input_ids = []
        for i, sentence_ids in enumerate(sentences_ids):
            if (len(current_input_ids) + len(sentence_ids) <= self.max_block_size):
                # if current block has enough space for the current sentence
                current_input_ids.extend(sentence_ids)
            else:
                # if the current block doesn't have enough space for the current sentence
                
                # clear the current block
                if current_input_ids:
                    add_to_input_ids_blocks(current_input_ids)
                    current_input_ids = []
                
                # if the sentence is too long to be less than max token size, trucate it
                if not (len(sentence_ids) <= self.max_block_size):
                    current_input_ids = sentence_ids[:self.max_block_size]
                    add_to_input_ids_blocks(current_input_ids)
                    current_input_ids = []
                    continue
                
                current_input_ids.extend(sentence_ids)
                
                # add the previous sentences to the current block if it is less than 512 tokens
                for j in range(min(self.num_prev_sentences, i)):
                    prev_sentence = sentences_ids[i-j-1]
                    if len(current_input_ids) + len(prev_sentence) <= self.max_block_size:
                        current_input_ids[:0] = prev_sentence
                    else:
                        # retain some of the first previous sentence for context learning
                        if j == 0:
                            diff = self.max_block_size - len(current_input_ids)
                            current_input_ids[:0] = prev_sentence[len(prev_sentence) - diff:len(prev_sentence)]
                            add_to_input_ids_blocks(current_input_ids)
                            current_input_ids = []
                        break
            
            if i == len(sentences_ids) - 1 and current_input_ids:
                add_to_input_ids_blocks(current_input_ids)
                current_input_ids = []
        
        return {
            'input_ids': torch.IntTensor(input_ids_blocks),
            'attention_mask': torch.IntTensor(attention_blocks)
        }
        
    
sprocessor = SentenceProcessor()
print(sprocessor.preprocess(text)['input_ids'].shape)

torch.Size([2, 512])
