## preprocess file and load file

This file contains functions for preprocessing files. Given a list of documents' names, first it determines if these documents have already been preprocessed. If one document has not been preprocessed, the function then will process this document, and then store this document in hard disc. If it has been preprocessed, this function will directly load its preprocessed version. Since preprocessing cost highly, this mechanism saves lots of time.

Here we are using spacy library and its German package to do the preprocessing. 

* **file_utils.py** contains code for creating one json from list of json strings. This json file is then loaded by **json.load**.
* **configuration.py** defines all path prefixes for load and store documents.
* The first part of this file is for loading a non german documents list. Since our program only processes German documents, these non German documents will have an negative impact for preprocessing and should be removed first.
* Function **\_read\_content\_of\_paragraphs** reads documents in json, one paragraph by one paragraph.
* Function **read_page_and_par_Info** reads a document, and return the page number of each paragraph of this document.
* Function **\_preprocess\_paragraphs** is to preprocess data, remove useless words, and then to store these preprocessed, clean documents to hard disc. 
* Function **\_load\_preprocessed\_paragraphs** is to load preprocessed, clean document from hard disc.
* Function **get_clean_documents** is the interface for other users. They should only use this function to get clean, preprocessed documents.

In [14]:
import spacy
from spacy.lang.de.stop_words import STOP_WORDS
from pathlib import Path
import json

In [15]:
############################################################################################################################
# In src/configuration.py, we define these paths:
#     FILE_PATH : where the code get raw json documents.
#     CLEAN_FILE_PREFIX : used to distinguish clean, preprocessed documents from raw documents.
#     NON_GERMAN_FILE_PATH : where the code get non German file list  
#     SHARE_SPACE_FOLDER_PATH : where the code store clean, preprocessed documents. It can be accessed by all group members. 
############################################################################################################################
%run src/file_utils.py
%run src/configuration.py

In [16]:
# Load non German file list
# Here we load this list from a txt file.
# This list will be used to filter all non german files in function get_clean_documents.
with open(NON_GERMAN_FILE_PATH) as f:    
    non_german_documents = [line.strip() + '.json' for line in f.readlines()]

In [17]:
# function to read raw json file
def _read_content_of_paragraphs(file_name):
    contents = []
    try:
        document_parts = json.loads(FileUtils.fix_json(file_name))
        for part in document_parts:
            if part['type'] == 'paragraph':
                contents.append(part['content'])
    except:
        print('Bad file: ' + file_name)
    return contents

In [18]:
###########################################################################################################################
# Given a document, this function returns each paragraph's corresponding page number in this document.
# parameter :
#          file_name: the absolute path of a document.
# return :
#          contents: a list, in which each entry is an object which contains paragraph number and this 
#          paragraph's corresponding page number.
###########################################################################################################################
def read_page_and_par_Info(file_name):
    contents = []
   
    try:
        data = json.loads(FileUtils.fix_json(file_name))
        for item in data:
            typeDoc = item['type']
            if typeDoc == 'paragraph':
                contents.append({
                    'page':item['pagenumber'],
                    'paragraph':item['counter']
                })
    except:
        pass
    return contents

In [19]:
# dictionaries contains as a "key" values, which we want to replace and
# as "value" the desired values.
replace_dict = dict()
replace_dict['gCO2'] = 'CO2'

# meaningful combinations of letters and digits, that we want to preserve.
digit_and_letter_combintaions = list()
digit_and_letter_combintaions.append('CO2')

# Here we use spacy to do preprocessing, remove stop words, useless words and digits. 
# After preprocessing, we get clean documents.
# Then we store these clean doucments in hard disc.
# note: here input document should contain its absolute path.
def _preprocess_paragraphs(document):
    nlp = spacy.load("de",disable=['parser', 'ner'])
    paragraphs = _read_content_of_paragraphs(document)
    # here is try to remove company name in documents, becasue these name is useless and harmful for our approach
    company_name = document[len(FILE_PATH):document.find('-')].lower()
    lemmatized_paragraphs = []
    for paragraph in paragraphs:
        # remove the - in document
        content_of_document = paragraph.replace('-\n','')
        content_of_document = content_of_document.replace('\n',' ')
    
        #replace all entries, which can be found in replacement dictionary
        for replace_source, replace_target in replace_dict.items():
            content_of_document = content_of_document.replace(replace_source, 
                                                              replace_target)
    
        #remove the character we don't need
        remove_char = content_of_document.maketrans('-',' ','+*<>%/&$')
        content_of_document = content_of_document.translate(remove_char)
    
        words = nlp(content_of_document)
        filtered_words = [word for word in words if word.lower_ not in STOP_WORDS] 
        filtered_words = [word for word in filtered_words if word.pos_ != 'NUM' and word.pos_ != 'SYM' and word.pos_ != 'PUNCT']
        filtered_words = [word for word in filtered_words if not word.is_digit]
        filtered_lemmata = [word.lemma_ for word in filtered_words]
    
        final = []  
        for lemma in filtered_lemmata:
            #remove the lemma contain company names
            if company_name in lemma.lower():
                continue
                
            if(any(c.isdigit() for c in lemma)):
                for combination in digit_and_letter_combintaions: 
                    if combination in lemma:
                        final.append(lemma)
            else:
                #remove the words contain dot
                if '.' not in lemma:
                    final.append(lemma)
        
        lemmatized_content = " ".join(final)
        lemmatized_paragraphs.append(lemmatized_content.lower())
        
    #output the result into file 
    filename = CLEAN_FILE_PREFIX  + document[len(FILE_PATH):] # remove file path, only preserve the document's name
    filename = SHARE_SPACE_FOLDER_PATH + filename  # add the clean data folder path
    with open(filename, 'w') as outfile:
        json.dump(lemmatized_paragraphs, outfile) # save clean data in json file, which is easy to read
    
    return lemmatized_paragraphs

In [20]:
# This function load clean paragraphs which we already have preprocessed and saved.
# Given an absolute path of document, it returns a list, in which each entry is a paragraph of this document.
def _load_preprocessed_paragraphs(document):
    with open(document, 'r') as f:
        datastore = json.load(f)
    return datastore

In [21]:
#################################################################################################################################
# Given a list of documents' name( without path), this function does preprocessing to these documents, and returns 
# preprocessed, clean documents. The form of return value depends on parameter, get_paragraph.
# parameter :
#     documents_list: a list, contain string of file name, which we want to preprocessing
#     get_paragraph: True if you want to get every paragrah, False if you want to get the whole document
#     logging: default = False. If set it as true, then it will print information about which document is currently 
#     preprocessing and which document has been already done and which document is non German document.
# return :
#     documents_clean: This returns value depends on the value of the parameter, get_paragraph. 
#                      if get_paragraph = false, it returns a list contains several strings, and each string contains a whole 
#                      document.
#                      if get_paragraph = true, it returns a list contains several lists of strings, and here each string is a
#                      paragraph of one document. All strings in one list make up a document.
#     documents_clean_name: a list contains string, every output documents' corresponding name
#################################################################################################################################

def get_clean_documents(documents_list, get_paragraph = False, logging=False):
    documents_clean = list()
    documents_clean_name = list()

    for document in documents_list:
        # fist check if this document is english
        if document in non_german_documents:
            if logging:
                print("this file "+ document + ' is non german, skip it')
            continue 
        # second check if this doc already be preprocessed
        my_file = Path(SHARE_SPACE_FOLDER_PATH + CLEAN_FILE_PREFIX + document)
        if my_file.is_file():
            # if already exist, we directly load the clean data from hard disk
            if logging: 
                print(CLEAN_FILE_PREFIX + document + " has already done preprocess")
            # load file!
            documents_clean_name.append(document)
            documents_clean.append(_load_preprocessed_paragraphs(SHARE_SPACE_FOLDER_PATH + CLEAN_FILE_PREFIX + document))
        else:
        # if not find, we do preprocess for this document, and save it in hard disk
            documents_clean_name.append(document)
            documents_clean.append(_preprocess_paragraphs(FILE_PATH + document))

    if not get_paragraph:
        # if we don't want paragraph, but the whole report. Here we join all paragraph to get a whole report. 
        documents_tmp = list()
        for document in documents_clean:
            document_tmp = " ".join(para for para in document)
            documents_tmp.append(document_tmp)
        documents_clean = documents_tmp
    return documents_clean, documents_clean_name