# Note: the file paths need to be revised to run this file

# Process the sketch engine corpus 
## Description
This file process the sketch engine corpus files, including the following procedures:
1. Take the first n-lines of the corpus file, save as an example file for an easy access to the data file text structure
2. Check the corpus files for encoding errors
3. Split the large data files into smaller one
4. Process the split files.  

## Requirements
The following packages are required:
1. pandas
2. tqdm
3. hanziconv
4. ChilectoUtility

__Note:__
1. Detailed description sections are above the corresponding code section.
2. Description is only provided for codes related to the first corpus.

## Initialize

The following section is required if the required package(s) is not located in system search path or the current folder.

In [8]:
import os
import sys
package_dir = '/home/projects/semmetrix/chilecto/code'
sys.path.insert(0, os.path.abspath(package_dir))

In [9]:
# import libs
from ChilectoUtility.gen import FileSplit
from ChilectoUtility.gen import rand_file_in_dir
from ChilectoUtility.processor import SketchengineProcessor

In [10]:
# set the file names and output directories 
corpus_file_1 = '../05_sketchengine/taiwan_corpus/chinese_taiwan.vert' # corpus data file
output_name_1 = '../05_sketchengine/taiwan_corpus/example_chinese_taiwan.vert' # example file
output_dir_1 = '../05_sketchengine/taiwan_corpus/split/' # directory of split files
processed_dir_1 = '../05_sketchengine/taiwan_corpus/processed/' # directory of processed files

corpus_file_2 = '../05_sketchengine/zhTenTen_corpus/zhTenTen.vert'
output_name_2 = '../05_sketchengine/zhTenTen_corpus/example_zhTenTen.vert'
output_dir_2 = '../05_sketchengine/zhTenTen_corpus/split/'
processed_dir_2 = '../05_sketchengine/zhTenTen_corpus/processed/'

## Get example file and check files

In [5]:
# initial FileSplit object
fs1 = FileSplit(corpus_file_1, output_dir_1) # data file name and the dir for split files are set at the initialization
fs2 = FileSplit(corpus_file_2, output_dir_2)

NameError: name 'FileSplit' is not defined

In [6]:
num_of_line = 10000 # get the first n line
fs1.split_head(num_of_line, output_name_1) # get the example file, file name is defined by output_name_1
fs2.split_head(num_of_line, output_name_2)

Grep the first 10000 line of ../05_sketchengine/taiwan_corpus/chinese_taiwan.vert
--->../05_sketchengine/taiwan_corpus/example_chinese_taiwan.vert is created.

Grep the first 10000 line of ../05_sketchengine/zhTenTen_corpus/zhTenTen.vert
--->../05_sketchengine/zhTenTen_corpus/example_zhTenTen.vert is created.



The FileSplit.check_file() function will try to read all lines of the data file. If errors are risen, the function will save the line number and error message to a log file. The log file is named as _CorpusFileName.log_, which is in the same folder as the corpus data file.

In [7]:
# Check the data file for file reading problem
# fs1.check_file()
# fs2.check_file() 
# This will take a long time. And you don't need to run it again

The data checking process found 300 lines with encoding errors in the first corpus, which corresponding to 300 words. 

At this moment, the repairing attempt has not been successful. Considering the number of errors is very small comparing to the total amount of words in the corpus, those lines will be omitted in future analyses. 

__Note__: those words will be replace by '?' by Python code while reading those lines.

## Split the data file

The large data file will be split into small files at the end of doc sections. Each split file contains a fixed number of doc sections, and the number can be changed while calling the function.

The function requires the *end_tag* setting which indicates where the file should be split. It should contain the entire line including the line break '\n'. 

In [10]:
# choose the split options
end_tag = '</doc>\n'# the end line of data section
num_of_tag = 500  # how many sections in one split file

To test the code, the example file generated before can be used instead of the full data file. The following section should be commented out while processing the original corpus data.

In [11]:
# using the example file for code test
fs1 = FileSplit(output_name_1, output_dir_1)
fs2 = FileSplit(output_name_1, output_dir_2)
# ------------------------------------

In [12]:
# set the split options in FileSplit objectes
fs1.set_split_tag(end_tag, num_of_tag) # set the split options in FileSplit object
fs2.set_split_tag(end_tag, num_of_tag*15)   # for zhTenTen corpus, the num_of_tag is 
                                            # increased to avoid too many split files

In [13]:
# split the file
fs1.split_tag() 
fs2.split_tag()

Splitting file ../05_sketchengine/taiwan_corpus/example_chinese_taiwan.vert by tag. In process ......
    1 files are created

Splitting file ../05_sketchengine/taiwan_corpus/example_chinese_taiwan.vert by tag. In process ......
    1 files are created



## Process the split files

The *SketchengineProcessor* class is used to process the sketch engine data files

In [15]:
# initialize the processor objects. 
sp_1 = SketchengineProcessor(output_dir_1, processed_dir_1)
sp_2 = SketchengineProcessor(output_dir_2, processed_dir_2)

In [16]:
# Run the process. The output values are time costed by processing each file.
sp_1.dir_process()
sp_2.dir_process()

Processing files in folder: ../05_sketchengine/taiwan_corpus/split/


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:05<00:00,  1.17it/s]


    --> Processing finished.
Processing files in folder: ../05_sketchengine/zhTenTen_corpus/split/


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:05<00:00,  1.14it/s]


    --> Processing finished.


[0.09132051467895508,
 4.991447925567627,
 4.457061052322388,
 4.447014093399048,
 3.795661687850952,
 0.5930333137512207]

## End of file