### Corpus Generation

After preprocessing, we want to generate text corpus for each document, which consists of a list of duplicated words.

#### 1. Run 02-Preprocessing.ipynb first, to make sure preprocessing function can be invoked

In [30]:
%run 02-Preprocessing.ipynb

In [31]:
import os
import pickle
import time

#### 2. Generate Coupus for each document

- target_corpus: folder to save the corpus for each document (/spacy_corpus folder)
- source_folder: folder contain Pre-extracted TXT files (/txt_data folder)
- spacyFlag: default: True, if set as True, to use spacy preprocessing, otherwise using nltk preprocessing

In [4]:

def generateCorpus(target_corpus, source_folder, spacyFlag=True):
    if not os.path.exists(target_corpus):
        os.makedirs(target_corpus)
    
    #monitor the file process
    progress = 0
    progress_count = 0
    
    start_time = time.process_time()

    for root, dirs, files in os.walk(source_folder):
        file_count = len(files)
        for f in files:
            progress += 1
            progress_count += 1
            tokens = []
            
            try:
                with open(root+'/'+f) as fr:
                    text = fr.read()
                    if spacyFlag == False:
                        tokens = nltkPreprocessingx(text)
                    if spacyFlag == True:
                        tokens = spacyPreprocessing(text)
            except:
                print('Error while processing file: ', f)
            
            
            with open(target_corpus+f, 'wb') as fw:
                    #use pickle library to dump list to file
                    pickle.dump(tokens, fw)

            if progress_count % 400 == 0:
                print('{0}% of total files has been processed'.format(round(progress/file_count*100), 2))
                print('running time: {0}'.format(time.process_time() - start_time))

            if progress_count == file_count:
                print('all files has been processed')
                


Before running the snippets below, make sure those txt files are prepared. If not, go to [Convert JSON to TXT](01-Convert_JSON_to_TXT.ipynb). 

In [33]:
%%time
#please modify the path
# target_corpus = '/home/bit/ma0/LabShare/data/chui_ma/spacy_corpus/'
# source_folder = '/home/bit/ma0/LabShare/data/chui_ma/txt_data/'

#relative folder path
target_corpus = '../spacy_corpus/'
source_folder = '../txt_data/'

generateCorpus(target_corpus, source_folder)

9% of total files has been processed
running time: 2118.1351720000002
18% of total files has been processed
running time: 4186.865105999999
27% of total files has been processed
running time: 6081.383608
36% of total files has been processed
running time: 7842.223702000001
45% of total files has been processed
running time: 9657.111828
54% of total files has been processed
running time: 11528.062233999999
63% of total files has been processed
running time: 13594.633854000002
72% of total files has been processed
running time: 15608.362729999999
81% of total files has been processed
running time: 17494.233487999998
90% of total files has been processed
running time: 19390.351499999997
99% of total files has been processed
running time: 21255.462284
all files has been processed
CPU times: user 2h 58min 44s, sys: 19min 22s, total: 3h 18min 7s
Wall time: 7h 44min 47s
