### Corpus Generation

After preprocessing, we want to generate text corpus for each document, which consists of a list of potentially duplicated words.

#### 1. Run 02-Preprocessing.ipynb first, to make sure preprocessing function can be invoked

In [1]:
%run 02-Preprocessing.ipynb

CPU times: user 299 ms, sys: 3.53 ms, total: 303 ms
Wall time: 77.1 ms
CPU times: user 1.39 s, sys: 76.1 ms, total: 1.47 s
Wall time: 548 ms


In [2]:
import os
import pickle
import time

#### 2. Generate Coupus for each document

- source_folder: folder contain Pre-extracted TXT files (/txt_data folder)
- corpus_folder: folder to save the corpus for each document (/spacy_corpus folder)
- spacyFlag: default: True, if set as True, to use spacy preprocessing, otherwise using nltk preprocessing

In [3]:
def generateCorpus(source_folder, corpus_folder, preprocessor=spacyPreprocessing):
    os.makedirs(corpus_folder, exist_ok=True)
    # monitor the file process
    progress_count = 1

    start_time = time.process_time()

    for root, dirs, files in os.walk(source_folder):
        file_count = len(files)
        for file in files:
            progress_count += 1
            tokens = []

            try:
                with open(root+'/'+file) as fr:
                    text = fr.read()
                    tokens = preprocessor(text)
                
                with open(corpus_folder+file, 'wb') as fw:
                    pickle.dump(tokens, fw)
                    
            except:
                print('Error while processing file: ', file)

            if progress_count % 200 == 0:
                print('{:6.2%} of the total files have been processed'.format(
                    progress_count/file_count), end='\r')
            
            if progress_count == file_count:
                print('all files have been processed')

In [4]:
%%time
#please modify the path
# corpus_folder = '/home/bit/ma0/LabShare/data/chui_ma/spacy_corpus/'
# source_folder = '/home/bit/ma0/LabShare/data/chui_ma/txt_data/'

#relative folder path
corpus_folder = '../spacy_corpus/'
source_folder = '../txt_data/'

# use spacyPreprocessing
generateCorpus(source_folder, corpus_folder, spacyPreprocessing)

12.08% of the total files have been processedCPU times: user 10min 49s, sys: 46.8 s, total: 11min 36s
Wall time: 4min 36s


After generating corpus, we would like to do some expriments on [TF-IDF](04-TF-IDF_Raw_Implementation.ipynb)