## Convert JSON files to TXT files

Since someone has already preprocessed all the original PDF files for business report, what we got is the human-readable JSON files, which is quite nice. Nevertheless, there are so much useless information inside. Therefore, we decide to extract useful information from JSON file and convert to TXT file, in order to speed up the processing speed.

In [2]:
#import packages
import os
import json

#### 1. Configure file path

Folder Structure:
    - Text_Mining_Lab (root folder)
        - /json_data (original json files)
        - /txt_data (converted txt files)
        - /spacy_corpus (preprocessed files, dumped by pickle library)
        - /executable_notebooks
            - /01-..ipynb
            ...
            - /07-..ipynb
            - /non_german_files.txt
        - /bank_quarterly_report (sample of reports to test LDA model)

Please keep consistent with our folder structure. To begin with, make sure you have the **json_data** folder which contains all JSON files for business reports and  **executable_notebooks** folder which hold your executable jupyter notebooks. **To minimize effort to modify file path, Try to use relative path**

**non_german_files.txt** : a list of file names whose business reports written in English, should be removed to avoid confusion. 

Please copy the **non_german_files.txt** from */home/bit/ma0/LabShare/data/non_german_files.txt* to the *executable_notebooks* folder

In [3]:
#please modify the path

#absolute path to json dataset
ABSOLUTE_SOURCE_PATH = '/home/bit/ma0/LabShare/data/chui_ma/json_data/'
ABSOLUTE_TARGET_PATH = '/home/bit/ma0/LabShare/data/chui_ma/txt_data/' 

#relative path to json dataset
RELATIVE_SOURCE_PATH = '../json_data/'
RELATIVE_TARGET_PATH = '../txt_data/'

#### 2. Extract paragraphs from JSON files and write into TXT file

From json files, only extract useful *content* whose type is *paragraph*. Meanwhile, set several variables to monitor the file conversion.
- successful_files: list, save all files that have been successfully read or extract infomation
- failure_files: list, save all files that fail to be read or extract infomation
- progress: int, keep track the number of files that have been processed

In [4]:
%%time
#keep track on the success/failure of files
successful_files = []
failure_files = []

#monitor the file process
progress = 0
progress_count = 0

if not os.path.exists(RELATIVE_TARGET_PATH):
    #create target folder to save processed text
    os.makedirs(RELATIVE_TARGET_PATH)
    
for root, dirs, files in os.walk(RELATIVE_SOURCE_PATH):
    file_count = len(files)
    for f in files:
        progress += 1
        progress_count += 1
        try:
            filename = RELATIVE_TARGET_PATH + f.split('.json')[0]
            with open(root+'/'+f, 'r') as freader, open(filename, 'w') as fwriter:
                #only extract 'content' from json file
                for line in freader:
                    try:
                        obj = json.loads(line)
                        if obj['type'] == 'paragraph':
                            fwriter.write(obj['content'])
                    except:
                        #skip JSON decode error
                        continue
            successful_files.append(filename)
        except:
            failure_files.append(filename)
        if progress_count % 200 == 0:
            print('{0}% of total files has been processed'.format(round(progress/file_count*100), 2))
        if progress_count == file_count:
            print('all files has been processed')
        

CPU times: user 246 µs, sys: 338 µs, total: 584 µs
Wall time: 291 µs


Let's have a quick check on how many files fail to be converted.

In [5]:
#quick check on file success/failure
print('\n{} files succesfully processed. {} files failed.'.format(len(successful_files), len(failure_files)))


0 files succesfully processed. 0 files failed.
