### Corpus Generation

After preprocessing, we want to generate text corpus for each document, which consists of a list of potentially duplicated words.

<font color="blue"/>

### dsp:
  * You mean "potentially duplicated words"? The list may as well contain words that are not duplicated &#x1f609;

#### 1. Run 02-Preprocessing.ipynb first, to make sure preprocessing function can be invoked

In [2]:
%run 02-Preprocessing.ipynb

CPU times: user 303 ms, sys: 3.94 ms, total: 307 ms
Wall time: 78.7 ms
CPU times: user 1.37 s, sys: 75.1 ms, total: 1.45 s
Wall time: 534 ms


In [3]:
import os
import pickle
import time

#### 2. Generate Coupus for each document

- source_folder: folder contain Pre-extracted TXT files (/txt_data folder)
- corpus_folder: folder to save the corpus for each document (/spacy_corpus folder)
- spacyFlag: default: True, if set as True, to use spacy preprocessing, otherwise using nltk preprocessing

In [4]:
def generateCorpus(source_folder, corpus_folder, preprocessor=spacyPreprocessing):
    os.makedirs(corpus_folder, exist_ok=True)
    # monitor the file process
    progress_count = 0

    start_time = time.process_time()

    for root, dirs, files in os.walk(source_folder):
        file_count = len(files)
        for file in files:
            progress_count += 1
            tokens = []

            try:
                with open(root+'/'+file) as fr:
                    text = fr.read()
                    tokens = preprocessor(text)
            except:
                print('Error while processing file: ', file)

            with open(corpus_folder+file, 'wb') as fw:
                pickle.dump(tokens, fw)

            if progress_count % 200 == 0:
                print('{:6.2%} of the total files have been processed'.format(
                    progress_count/file_count), end='\r')
                break

            if progress_count == file_count:
                print('all files have been processed')

<font color="blue"/>

### dsp:
  * The names `target_corpus` and `source_folder` seem to indicate that the first is some object/data structure while the second is a folder. That's confusing. Suggestion: Rename `target_corpus` to `corpus_folder`. Or maybe you find something even better.
  * For me the parameter order 1) source 2) target feels more natural, but this might depend on the context.
  * Flag parameters are always worth a second look. In your case you code depends on the NLTK even, when you do not use it. If it is a considerable option to add another pre-processing, you would need to change the type of `spacyFlag`. You might consider to pass the actual pre-processing functions (Signature: (str) -> (list(str))) as  a parameter, like: `def generateCorpus(source_folder, corpus_folder, preprocessor=spacy_preprocessing):`  and later `tokens = preprocessor(text)`.
  * What is the difference between `progress` and `progress_count`?
  * Are you sure, you want to store the tokens, even when there was an exception?  
  * Call `f` `file`. My eyes needed to jump around to understand what `f` stands for.
  * With `os.makedirs(name, exist_ok=True)` you don't need the existance check. (But that's a matter of taste.)
  * The comment `#use pickle library to dump list to file` before `pickle.dump(tokens, fw)` justs states the obvious to me.
  * "all files has" ~> "all files have"

In [5]:
%%time
#please modify the path
# corpus_folder = '/home/bit/ma0/LabShare/data/chui_ma/spacy_corpus/'
# source_folder = '/home/bit/ma0/LabShare/data/chui_ma/txt_data/'

#relative folder path
corpus_folder = '../spacy_corpus/'
source_folder = '../txt_data/'

# use spacyPreprocessing
generateCorpus(source_folder, corpus_folder, spacyPreprocessing)

12.08% of the total files have been processedCPU times: user 10min 24s, sys: 44.1 s, total: 11min 8s
Wall time: 4min 37s


After generating corpus, we would like to do some expriments on [TF-IDF](04-TF-IDF_Raw_Implementation.ipynb)

<font color="blue"/>

### dsp:
  * Either concentrate the variables that need modification in a separate file or at the beginning of the Notebook.
  * The hint to "Convert JSON to TXT" is probably as well more natural at the beginning of the Notebook, as you did in the next Notebook.