## Convert JSON files to TXT files

Since someone has already preprocessed all the original PDF files for business report, what we got is the human-readable JSON files, which is quite nice. Nevertheless, there are so much useless information inside. Therefore, we decide to extract useful information from JSON file and convert to TXT file, in order to speed up the processing speed.

In [1]:
#import packages
import os
import json

**Make sure all data files are extracted under /json_data folder**

In [2]:
# absolute path
ABSOLUTE_SOURCE_PATH = '/home/bit/ma0/LabShare/data/chui_ma/json_data/'
ABSOLUTE_TARGET_PATH = '/home/bit/ma0/LabShare/data/chui_ma/txt_data/'

# relative path
RELATIVE_SOURCE_PATH = '../json_data/'
RELATIVE_TARGET_PATH = '../txt_data/'

#### 1. Extract paragraphs from JSON files and write into TXT file

From json files, only extract useful *content* whose type is *paragraph*. Meanwhile, set several variables to monitor the file conversion.
- successful_files: list, save all files that have been successfully read or extract infomation
- failure_files: list, save all files that fail to be read or extract infomation
- progress: int, keep track the number of files that have been processed

In [3]:
def extractContent(filename):
    with open(root+'/'+f, 'r') as freader, open(filename, 'w') as fwriter:
        # only extract 'content' from json file
        for line in freader:
            try:
                obj = json.loads(line)
                if obj['type'] == 'paragraph':
                    fwriter.write(obj['content'])
            except:
                # skip JSON decode error
                continue

In [4]:
%%time
# keep track on the success/failure of files
successful_files = []
failure_files = []

# monitor the file process
progress = 0
progress_count = 0

if not os.path.exists(RELATIVE_TARGET_PATH):
    # create target folder to save processed text
    os.makedirs(RELATIVE_TARGET_PATH)

for root, dirs, files in os.walk(RELATIVE_SOURCE_PATH):
    file_count = len(files)
    for f in files:
        progress += 1
        progress_count += 1
        try:
            filename = RELATIVE_TARGET_PATH + f.split('.json')[0]
            extractContent(filename)
            successful_files.append(filename)
        except:
            failure_files.append(filename)
        if progress_count % 200 == 0:
            print('{:6.2%} of the total files has been processed'.format(
                progress/file_count), end='\r')
        if progress_count == file_count:
            print('all files has been processed')

all files has been processeds been processed
CPU times: user 33.4 s, sys: 1.93 s, total: 35.3 s
Wall time: 36.6 s


In [5]:
#quick check on file success/failure
print('{} files succesfully processed. {} files failed.'.format(len(successful_files), len(failure_files)))

1655 files succesfully processed. 0 files failed.


After conversion, we would like to do file [Preprocessing](02-Preprocessing.ipynb)