## Convert JSON files to TXT files

Since someone has already preprocessed all the original PDF files for business report, what we got is the human-readable JSON files, which is quite nice. Nevertheless, there are so much useless information inside. Therefore, we decide to extract useful information from JSON file and convert to TXT file, in order to speed up the processing speed.

In [7]:
import os
cwd = os.getcwd()
cwd

'/Users/xiaoqi/Github_Project/feedback'

In [1]:
#import packages
import os
import json

#### 1. Configure file path

Folder Structure:
    - Text_Mining_Lab (root folder)
        - /json_data (original json files)
        - /txt_data (converted txt files)
        - /spacy_corpus (preprocessed files, dumped by pickle library)
        - /executable_notebooks
            - /01-..ipynb
            ...
            - /07-..ipynb
            - /non_german_files.txt
        - /bank_quarterly_report (sample of reports to test LDA model)

Please keep consistent with our folder structure. To begin with, make sure you have the **json_data** folder which contains all JSON files for business reports and  **executable_notebooks** folder which hold your executable jupyter notebooks. **To minimize effort to modify file path, Try to use relative path**

**non_german_files.txt** : a list of file names whose business reports written in English, should be removed to avoid confusion. 

Please copy the **non_german_files.txt** from */home/bit/ma0/LabShare/data/non_german_files.txt* to the *executable_notebooks* folder

<font color="blue"/>

### dsp:
  * &#x1f642;  Nice that you documented the folder structure that you expect to precisely.
  * It could be a bit more flexibel. I didn't try it this time.

In [2]:
#please modify the path

#absolute path
ABSOLUTE_SOURCE_PATH = '/home/bit/ma0/LabShare/data/chui_ma/json_data/'
ABSOLUTE_TARGET_PATH = '/home/bit/ma0/LabShare/data/chui_ma/txt_data/' 

#relative path
RELATIVE_SOURCE_PATH = '../json_data/'
RELATIVE_TARGET_PATH = '../txt_data/'

#### 2. Extract paragraphs from JSON files and write into TXT file

From json files, only extract useful *content* whose type is *paragraph*. Meanwhile, set several variables to monitor the file conversion.
- successful_files: list, save all files that have been successfully read or extract infomation
- failure_files: list, save all files that fail to be read or extract infomation
- progress: int, keep track the number of files that have been processed

In [None]:
%%time
#keep track on the success/failure of files
successful_files = []
failure_files = []

#monitor the file process
progress = 0
progress_count = 0

if not os.path.exists(RELATIVE_TARGET_PATH):
    #create target folder to save processed text
    os.makedirs(RELATIVE_TARGET_PATH)
    
for root, dirs, files in os.walk(RELATIVE_SOURCE_PATH):
    file_count = len(files)
    for f in files:
        progress += 1
        progress_count += 1
        try:
            filename = RELATIVE_TARGET_PATH + f.split('.json')[0]
            with open(root+'/'+f, 'r') as freader, open(filename, 'w') as fwriter:
                #only extract 'content' from json file
                for line in freader:
                    try:
                        obj = json.loads(line)
                        if obj['type'] == 'paragraph':
                            fwriter.write(obj['content'])
                    except:
                        #skip JSON decode error
                        continue
            successful_files.append(filename)
        except:
            failure_files.append(filename)
        if progress_count % 200 == 0:
            print('{:6.2%} of the total files has been processed'.format(progress/file_count), end='\r')
        if progress_count == file_count:
            print('all files has been processed')
        

<font color="blue"/>

### dsp:
  * &#x1f642; It is nice to get feedback during long running computations.
  * Two minor tips:
    * You may use '\r' to update the same line again and again.
    * Percentage computation is available as format specification. Thus your line could read:
```python
print('{:6.2%} of the total files has been processed'.format(progress/file_count), end='\r')
```
  * Limiting the time spent on output and the amount of is, is often essential. I assume the reading here takes up enough time so that you could as well update the output for each file. (But I might be wrong.)
  * The nesting level in this cell is a bit high. You might consider extracting parts into a separate function. E.g the "only extract 'content' from json file" part. But, it is up to your judgment.

In [4]:
#quick check on file success/failure
print('{} files succesfully processed. {} files failed.'.format(len(successful_files), len(failure_files)))


4443 files succesfully processed. 0 files failed.


<font color="blue"/>

### dsp:
  * The leading '\n' should be unnecessary here.
  * &#x1f642; Seprating the conversion step from the other steps is very reasonable for me!!! Everything that follows now can work on any textfile containing paragraphs as lines. Minor remarks:
    * Some other groups make use of further information that is in the JSON file.
    * You might want to extract the configuration information into a separate file, so that all notebooks that do not need to know of the JSON files do not need to know of this conversion Notebook as well.
