### Corpus Generation

After preprocessing, we want to generate text corpus for each document, which consists of a list of duplicated words.

<font color="blue"/>

### dsp:
  * You mean "potentially duplicated words"? The list may as well contain words that are not duplicated &#x1f609;

#### 1. Run 02-Preprocessing.ipynb first, to make sure preprocessing function can be invoked

In [30]:
%run 02-Preprocessing.ipynb

In [31]:
import os
import pickle
import time

#### 2. Generate Coupus for each document

- target_corpus: folder to save the corpus for each document (/spacy_corpus folder)
- source_folder: folder contain Pre-extracted TXT files (/txt_data folder)
- spacyFlag: default: True, if set as True, to use spacy preprocessing, otherwise using nltk preprocessing

In [None]:
def generateCorpus(target_corpus, source_folder, spacyFlag=True):
    if not os.path.exists(target_corpus):
        os.makedirs(target_corpus)
    
    #monitor the file process
    progress = 0
    progress_count = 0
    
    start_time = time.process_time()

    for root, dirs, files in os.walk(source_folder):
        file_count = len(files)
        for f in files:
            progress += 1
            progress_count += 1
            tokens = []
            
            try:
                with open(root+'/'+f) as fr:
                    text = fr.read()
                    if spacyFlag == False:
                        tokens = nltkPreprocessing(text)
                    if spacyFlag == True:
                        tokens = spacyPreprocessing(text)
            except:
                print('Error while processing file: ', f)
            
            
            with open(target_corpus+f, 'wb') as fw:
                    #use pickle library to dump list to file
                    pickle.dump(tokens, fw)

            if progress_count % 400 == 0:
                print('{0}% of total files has been processed'.format(round(progress/file_count*100), 2))
                print('running time: {0}'.format(time.process_time() - start_time))

            if progress_count == file_count:
                print('all files has been processed')

<font color="blue"/>

### dsp:
  * The names `target_corpus` and `source_folder` seem to indicate that the first is some object/data structure while the second is a folder. That's confusing. Suggestion: Rename `target_corpus` to `corpus_folder`. Or maybe you find something even better.
  * For me the parameter order 1) source 2) target feels more natural, but this might depend on the context.
  * Flag parameters are always worth a second look. In your case you code depends on the NLTK even, when you do not use it. If it is a considerable option to add another pre-processing, you would need to change the type of `spacyFlag`. You might consider to pass the actual pre-processing functions (Signature: (str) -> (list(str))) as  a parameter, like: `def generateCorpus(source_folder, corpus_folder, preprocessor=spacy_preprocessing):`  and later `tokens = preprocessor(text)`.
  * What is the difference between `progress` and `progress_count`?
  * Are you sure, you want to store the tokens, even when there was an exception?  
  * Call `f` `file`. My eyes needed to jump around to understand what `f` stands for.
  * With `os.makedirs(name, exist_ok=True)` you don't need the existance check. (But that's a matter of taste.)
  * The comment `#use pickle library to dump list to file` before `pickle.dump(tokens, fw)` justs states the obvious to me.
  * "all files has" ~> "all files have"

Before running the snippets below, make sure those txt files are prepared. If not, go to [Convert JSON to TXT](01-Convert_JSON_to_TXT.ipynb). 

In [None]:
%%time
#please modify the path
# target_corpus = '/home/bit/ma0/LabShare/data/chui_ma/spacy_corpus/'
# source_folder = '/home/bit/ma0/LabShare/data/chui_ma/txt_data/'

#relative folder path
target_corpus = '../spacy_corpus/'
source_folder = '../txt_data/'

generateCorpus(target_corpus, source_folder)

<font color="blue"/>

### dsp:
  * Either concentrate the variables that need modification in a separate file or at the beginning of the Notebook.
  * The hint to "Convert JSON to TXT" is probably as well more natural at the beginning of the Notebook, as you did in the next Notebook.