<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Setup" data-toc-modified-id="Setup-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Setup---Debug" data-toc-modified-id="Setup---Debug-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Setup - Debug</a></span></li><li><span><a href="#Setup---Imports" data-toc-modified-id="Setup---Imports-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Setup - Imports</a></span></li><li><span><a href="#Setup---logging" data-toc-modified-id="Setup---logging-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Setup - logging</a></span></li><li><span><a href="#Setup---virtualenv-jupyter-kernel" data-toc-modified-id="Setup---virtualenv-jupyter-kernel-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Setup - virtualenv jupyter kernel</a></span></li><li><span><a href="#Setup---Initialize-Django" data-toc-modified-id="Setup---Initialize-Django-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Setup - Initialize Django</a></span></li><li><span><a href="#Setup---Initialize-LoggingHelper" data-toc-modified-id="Setup---Initialize-LoggingHelper-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Setup - Initialize LoggingHelper</a></span></li></ul></li><li><span><a href="#Find-articles-to-be-loaded" data-toc-modified-id="Find-articles-to-be-loaded-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Find articles to be loaded</a></span><ul class="toc-item"><li><span><a href="#Loading-setup---working-folder-paths" data-toc-modified-id="Loading-setup---working-folder-paths-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Loading setup - working folder paths</a></span></li><li><span><a href="#Uncompress-files" data-toc-modified-id="Uncompress-files-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Uncompress files</a></span></li><li><span><a href="#Work-with-uncompressed-files" data-toc-modified-id="Work-with-uncompressed-files-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Work with uncompressed files</a></span></li><li><span><a href="#load-a-file-into-memory." data-toc-modified-id="load-a-file-into-memory.-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>load a file into memory.</a></span></li><li><span><a href="#build-list-of-all-ObjectTypes" data-toc-modified-id="build-list-of-all-ObjectTypes-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>build list of all ObjectTypes</a></span></li></ul></li><li><span><a href="#TODO" data-toc-modified-id="TODO-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>TODO</a></span></li></ul></div>

# Introduction

- Back to [Table of Contents](#Table-of-Contents)

This is a notebook that expands on the OpenCalais code in the file `article_coding.py`, also in this folder.  It includes more sections on selecting publications you want to submit to OpenCalais as an example.  It is intended to be copied and re-used.

# Setup

- Back to [Table of Contents](#Table-of-Contents)

## Setup - Debug

- Back to [Table of Contents](#Table-of-Contents)

In [1]:
debug_flag = False

## Setup - Imports

- Back to [Table of Contents](#Table-of-Contents)

In [2]:
import datetime
import glob
import logging
import lxml
import os
import six
import xml
import xmltodict
import zipfile

## Setup - logging

- Back to [Table of Contents](#Table-of-Contents)

configure logging for this notebook's kernel (If you do not run this cell, you'll get the django application's logging configuration.

In [3]:
logging.basicConfig(
    level = logging.INFO,
    format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
    filename = '/home/jonathanmorgan/logs/django-research-data_load-BostonGlobe.log.txt',
    filemode = 'w' # set to 'a' if you want to append, rather than overwrite each time.
)

## Setup - virtualenv jupyter kernel

- Back to [Table of Contents](#Table-of-Contents)

If you are using a virtualenv, make sure that you:

- have installed your virtualenv as a kernel.
- choose the kernel for your virtualenv as the kernel for your notebook (Kernel --> Change kernel).

Since I use a virtualenv, need to get that activated somehow inside this notebook.  One option is to run `../dev/wsgi.py` in this notebook, to configure the python environment manually as if you had activated the `sourcenet` virtualenv.  To do this, you'd make a code cell that contains:

    %run ../dev/wsgi.py
    
This is sketchy, however, because of the changes it makes to your Python environment within the context of whatever your current kernel is.  I'd worry about collisions with the actual Python 3 kernel.  Better, one can install their virtualenv as a separate kernel.  Steps:

- activate your virtualenv:

        workon research

- in your virtualenv, install the package `ipykernel`.

        pip install ipykernel

- use the ipykernel python program to install the current environment as a kernel:

        python -m ipykernel install --user --name <env_name> --display-name "<display_name>"
        
    `sourcenet` example:
    
        python -m ipykernel install --user --name sourcenet --display-name "research (Python 3)"
        
More details: [http://ipython.readthedocs.io/en/stable/install/kernel_install.html](http://ipython.readthedocs.io/en/stable/install/kernel_install.html)

## Setup - Initialize Django

- Back to [Table of Contents](#Table-of-Contents)

First, initialize my dev django project, so I can run code in this notebook that references my django models and can talk to the database using my project's settings.

In [4]:
# init django
django_init_folder = "/home/jonathanmorgan/work/django/research/work/phd_work"
django_init_path = "django_init.py"
if( ( django_init_folder is not None ) and ( django_init_folder != "" ) ):
    
    # add folder to front of path.
    django_init_path = "{}/{}".format( django_init_folder, django_init_path )
    
#-- END check to see if django_init folder. --#

In [5]:
%run $django_init_path

django initialized at 2019-08-07 16:42:13.298381


In [6]:
# context_text imports
from context_text.article_coding.article_coding import ArticleCoder
from context_text.article_coding.article_coding import ArticleCoding
from context_text.article_coding.open_calais_v2.open_calais_v2_article_coder import OpenCalaisV2ArticleCoder
from context_text.collectors.newsbank.newspapers.GRPB import GRPB
from context_text.collectors.newsbank.newspapers.DTNB import DTNB
from context_text.models import Article
from context_text.models import Article_Subject
from context_text.models import Newspaper
from context_text.shared.context_text_base import ContextTextBase

## Setup - Initialize LoggingHelper

- Back to [Table of Contents](#Table-of-Contents)

Create a LoggingHelper instance to use to log debug and also print at the same time.

Preconditions: Must be run after Django is initialized, since `python_utilities` is in the django path.

In [20]:
# python_utilities
from python_utilities.logging.logging_helper import LoggingHelper

# init
my_logging_helper = LoggingHelper()
my_logging_helper.set_logger_name( "proquest_hnp-article-loading-BostonGlobe" )
log_message = None

# Find articles to be loaded

- Back to [Table of Contents](#Table-of-Contents)

Specify which folder of XML files should be loaded into system, then process all files within the folder.

The compressed archives from proquest_hnp just contain publication XML files, no containing folder.

To process:

- **uncompresed paper folder ( `<paper_folder>` )** - make a folder in `/mnt/hgfs/projects/phd/proquest_hnp/uncompressed` for the paper whose data you are working with, named the same as the paper's folder in `/mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data`.

    - for example, for the Boston Globe, name it "`BostonGlobe`".

- **uncompressed archive folder ( `<archive_folder>` )** - inside a given paper's folder in uncompressed, for each archive file, create a folder named the same as the archive file, but with no ".zip" at the end.

    - For example, for the file "`BG_20171002210239_00001.zip`", make a folder named "`BG_20171002210239_00001`".
    - path should be "`<paper_folder>/<archive_name_no_zip>`.

- unzip the archive into this folder:

        unzip <path_to_zip> -d <archive_folder>



## Loading setup - working folder paths

- Back to [Table of Contents](#Table-of-Contents)

What data are we looking at?

In [21]:
# paper identifier
paper_identifier = "BostonGlobe"
archive_identifier = None

# source
source_paper_folder = "/mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data"
source_paper_path = "{}/{}".format( source_paper_folder, paper_identifier )
source_archive_file = "{}.zip".format( archive_identifier )
source_archive_path = "{}/{}".format( source_paper_path, source_archive_file )

# uncompressed
uncompressed_paper_folder = "/mnt/hgfs/projects/phd/proquest_hnp/uncompressed"
uncompressed_paper_path = "{}/{}".format( uncompressed_paper_folder, paper_identifier )

# make sure an identifier is set before you make a path here.
if ( ( archive_identifier is not None ) and ( archive_identifier != "" ) ):
    
    # identifier is set.
    uncompressed_archive_path = "{}/{}".format( uncompressed_paper_path, archive_identifier )

#-- END check to see if archive_identifier present. --#

## Uncompress files

- Back to [Table of Contents](#Table-of-Contents)

See if the uncompressed paper folder exists.  If not, set flag and create it.

In [8]:
# declare variables
did_uncomp_paper_folder_exist = False

# check if uncompressed paper folder exists.
if not os.path.exists( uncompressed_paper_path ):
    
    # no.  Make it.
    os.makedirs( uncompressed_paper_path )
    did_uncomp_paper_folder_exist = False
    log_message = "CREATED - Uncompressed paper folder {}".format( uncompressed_paper_path )
    my_logging_helper.output_debug_message( log_message, do_print_IN = True )
    
else:
    
    # yes.  Set flag.
    did_uncomp_paper_folder_exist = True
    log_message = "EXISTS - Uncompressed paper folder {}".format( uncompressed_paper_path )
    my_logging_helper.output_debug_message( log_message, do_print_IN = True )
    
#-- END check to see if paper folder exists.

EXISTS - Uncompressed paper folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe


For each *.zip file in the paper's source folder:

- parse file name from path returned by glob.
- parse the part before ".zip" from the file name.  This is referred to subsequently as the "archive identifier".
- check if folder named the same as the "archive identifier" is present.

    - If no:
    
        - create it.
        - then, uncompress the archive into it.
        
    - If yes:
    
        - output a message.  Don't want to uncompress if it was already uncompressed once.

In [12]:
# declare variables - papers
did_uncomp_paper_folder_exist = None

# declare variables archive (.zip) files.
zip_file_list = None
zip_file_path = None
zip_file_path_parts_list = None
zip_file_name = None
zip_file_name_parts_list = None
archive_identifier = None
uc_archive_folder_path = None
zip_file = None

# declare variables - auditing (uc = uncompressed)
archive_file_counter = None
did_uc_archive_folder_exist = None
uc_folder_exists_counter = None
start_dt = None
end_dt = None
time_delta = None

# use glob to get list of zip files in paper source folder.
zip_file_list = glob.glob( "{}/*.zip".format( source_paper_path ) )
zip_file_count = len( zip_file_list )

log_message = "==> zip file count: {}".format( zip_file_count )
my_logging_helper.output_debug_message( log_message, do_print_IN = True )

# loop over zip files.
archive_file_counter = 0
did_uc_archive_folder_exist = False
uc_folder_exists_counter = 0
for zip_file_path in zip_file_list:
    
    # increment counter
    archive_file_counter += 1
    
    log_message = "----> processing file {} of {}".format( archive_file_counter, zip_file_count )
    my_logging_helper.output_debug_message( log_message, do_print_IN = True )
    
    # get file name
    
    # split path into parts on path separator.
    zip_file_path_parts_list = zip_file_path.split( "/" )

    # file name is the last thing in the list.
    zip_file_name = zip_file_path_parts_list[ -1 ]

    # archive_identifier is name with ".zip" removed from end.
    zip_file_name_parts_list = zip_file_name.split( ".zip" )
    archive_identifier = zip_file_name_parts_list[ 0 ]
    
    # for now, log and print the things we've just created.
    log_message = "==> path: {}".format( zip_file_path )
    my_logging_helper.output_debug_message( log_message, do_print_IN = True )
    log_message = "==> file: {}".format( zip_file_name )
    my_logging_helper.output_debug_message( log_message, do_print_IN = True )
    log_message = "==> ID: {}".format( archive_identifier )
    my_logging_helper.output_debug_message( log_message, do_print_IN = True )

    # check if uncompressed archive folder exists.
    uc_archive_folder_path = "{}/{}".format( uncompressed_paper_path, archive_identifier )

    log_message = "==> TO: {}".format( uc_archive_folder_path )
    my_logging_helper.output_debug_message( log_message, do_print_IN = True )

    # check if the uncompressed archive folder exists.
    did_uc_archive_folder_exist = os.path.exists( uc_archive_folder_path )
    if did_uc_archive_folder_exist == False:

        # no.  Make it.
        os.makedirs( uc_archive_folder_path )
        log_message = "CREATED - Uncompressed archive folder {}".format( uc_archive_folder_path )
        my_logging_helper.output_debug_message( log_message, do_print_IN = True )
        
        # and uncompress archive to it.
        with zipfile.ZipFile( zip_file_path, 'r' ) as zip_file:

            # starting extract.
            start_dt = datetime.datetime.now()
            log_message = "==> extract started at {}".format( start_dt )
            my_logging_helper.output_debug_message( log_message, do_print_IN = True )
            
            # unzip to uncompressed archive folder path.
            zip_file.extractall( uc_archive_folder_path )
            
            log_message = "EXTRACTED - {}".format( zip_file_path )
            my_logging_helper.output_debug_message( log_message, do_print_IN = True )

            log_message = "TO uncompressed archive folder - {}".format( uc_archive_folder_path )
            my_logging_helper.output_debug_message( log_message, do_print_IN = True )
            
            # complete
            end_dt = datetime.datetime.now()
            
            log_message = "==> extract completed at {}".format( end_dt )
            my_logging_helper.output_debug_message( log_message, do_print_IN = True )
            
            log_message = "==> time elapsed: {}".format( end_dt - start_dt )
            my_logging_helper.output_debug_message( log_message, do_print_IN = True )
            
        #-- END with ZipFile --#

    else:

        # yes.  Set flag.
        uc_folder_exists_counter += 1
        log_message = "EXISTS, so moving on - Uncompressed archive folder {}".format( uncompressed_archive_path )
        my_logging_helper.output_debug_message( log_message, do_print_IN = True )

    #-- END check to see if archive folder exists. --#

    log_message = "------------------------------"
    my_logging_helper.output_debug_message( log_message, do_print_IN = True )

#-- END loop over zip files. --#

==> zip file count: 474
----> processing file 1 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210212722_00001.zip
==> file: BG_20151210212722_00001.zip
==> ID: BG_20151210212722_00001
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210212722_00001
EXISTS, so moving on - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20171002210239_00001
------------------------------
----> processing file 2 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210212824_00002.zip
==> file: BG_20151210212824_00002.zip
==> ID: BG_20151210212824_00002
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210212824_00002
EXISTS, so moving on - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20171002210239_00001
------------------------------
----> processing file 3 of 474
==> path: /mnt/

==> extract started at 2019-08-06 03:21:58.085016
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210221951_00011.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210221951_00011
==> extract completed at 2019-08-06 03:22:58.860069
==> time elapsed: 0:01:00.775053
------------------------------
----> processing file 63 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210222052_00012.zip
==> file: BG_20151210222052_00012.zip
==> ID: BG_20151210222052_00012
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210222052_00012
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210222052_00012
==> extract started at 2019-08-06 03:22:59.254028
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210222052_00012.zip
TO uncompressed archive folder -

==> extract started at 2019-08-06 03:32:43.432826
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210223309_00009.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210223309_00009
==> extract completed at 2019-08-06 03:33:40.643441
==> time elapsed: 0:00:57.210615
------------------------------
----> processing file 74 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210223310_00010.zip
==> file: BG_20151210223310_00010.zip
==> ID: BG_20151210223310_00010
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210223310_00010
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210223310_00010
==> extract started at 2019-08-06 03:33:40.926361
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210223310_00010.zip
TO uncompressed archive folder -

==> extract started at 2019-08-06 03:44:00.108615
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210224626_00005.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210224626_00005
==> extract completed at 2019-08-06 03:45:01.314292
==> time elapsed: 0:01:01.205677
------------------------------
----> processing file 85 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210224628_00006.zip
==> file: BG_20151210224628_00006.zip
==> ID: BG_20151210224628_00006
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210224628_00006
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210224628_00006
==> extract started at 2019-08-06 03:45:01.672662
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210224628_00006.zip
TO uncompressed archive folder -

==> extract started at 2019-08-06 03:54:46.938229
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210225842_00002.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210225842_00002
==> extract completed at 2019-08-06 03:55:45.204610
==> time elapsed: 0:00:58.266381
------------------------------
----> processing file 96 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210225943_00003.zip
==> file: BG_20151210225943_00003.zip
==> ID: BG_20151210225943_00003
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210225943_00003
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210225943_00003
==> extract started at 2019-08-06 03:55:45.480919
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210225943_00003.zip
TO uncompressed archive folder -

==> extract started at 2019-08-06 04:05:42.513117
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210230656_00013.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210230656_00013
==> extract completed at 2019-08-06 04:06:21.739889
==> time elapsed: 0:00:39.226772
------------------------------
----> processing file 107 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210231201_00001.zip
==> file: BG_20151210231201_00001.zip
==> ID: BG_20151210231201_00001
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210231201_00001
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210231201_00001
==> extract started at 2019-08-06 04:06:22.078363
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210231201_00001.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 04:17:18.907102
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210232015_00011.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210232015_00011
==> extract completed at 2019-08-06 04:17:55.048715
==> time elapsed: 0:00:36.141613
------------------------------
----> processing file 118 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210232721_00001.zip
==> file: BG_20151210232721_00001.zip
==> ID: BG_20151210232721_00001
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210232721_00001
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210232721_00001
==> extract started at 2019-08-06 04:17:55.478839
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210232721_00001.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 04:27:36.555522
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210234137_00001.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210234137_00001
==> extract completed at 2019-08-06 04:28:42.563467
==> time elapsed: 0:01:06.007945
------------------------------
----> processing file 129 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210234239_00002.zip
==> file: BG_20151210234239_00002.zip
==> ID: BG_20151210234239_00002
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210234239_00002
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210234239_00002
==> extract started at 2019-08-06 04:28:43.003781
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210234239_00002.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 04:38:49.378921
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210235855_00003.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210235855_00003
==> extract completed at 2019-08-06 04:39:49.783355
==> time elapsed: 0:01:00.404434
------------------------------
----> processing file 140 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210235956_00004.zip
==> file: BG_20151210235956_00004.zip
==> ID: BG_20151210235956_00004
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210235956_00004
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210235956_00004
==> extract started at 2019-08-06 04:39:50.089227
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151210235956_00004.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 04:49:26.452859
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211001416_00004.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211001416_00004
==> extract completed at 2019-08-06 04:50:23.184877
==> time elapsed: 0:00:56.732018
------------------------------
----> processing file 151 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211001417_00005.zip
==> file: BG_20151211001417_00005.zip
==> ID: BG_20151211001417_00005
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211001417_00005
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211001417_00005
==> extract started at 2019-08-06 04:50:23.466931
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211001417_00005.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 04:59:16.520006
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211003138_00007.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211003138_00007
==> extract completed at 2019-08-06 05:00:14.946059
==> time elapsed: 0:00:58.426053
------------------------------
----> processing file 162 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211003139_00008.zip
==> file: BG_20151211003139_00008.zip
==> ID: BG_20151211003139_00008
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211003139_00008
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211003139_00008
==> extract started at 2019-08-06 05:00:15.253028
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211003139_00008.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 05:09:44.703851
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211004856_00009.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211004856_00009
==> extract completed at 2019-08-06 05:10:40.179548
==> time elapsed: 0:00:55.475697
------------------------------
----> processing file 173 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211004957_00010.zip
==> file: BG_20151211004957_00010.zip
==> ID: BG_20151211004957_00010
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211004957_00010
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211004957_00010
==> extract started at 2019-08-06 05:10:40.419398
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211004957_00010.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 05:19:32.043308
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211010313_00010.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211010313_00010
==> extract completed at 2019-08-06 05:20:25.655858
==> time elapsed: 0:00:53.612550
------------------------------
----> processing file 184 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211010414_00011.zip
==> file: BG_20151211010414_00011.zip
==> ID: BG_20151211010414_00011
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211010414_00011
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211010414_00011
==> extract started at 2019-08-06 05:20:25.726203
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211010414_00011.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 05:28:34.451846
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211011829_00010.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211011829_00010
==> extract completed at 2019-08-06 05:29:27.504534
==> time elapsed: 0:00:53.052688
------------------------------
----> processing file 195 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211011930_00011.zip
==> file: BG_20151211011930_00011.zip
==> ID: BG_20151211011930_00011
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211011930_00011
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211011930_00011
==> extract started at 2019-08-06 05:29:27.653684
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211011930_00011.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 05:37:55.085587
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211013345_00010.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211013345_00010
==> extract completed at 2019-08-06 05:38:48.705622
==> time elapsed: 0:00:53.620035
------------------------------
----> processing file 206 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211013346_00011.zip
==> file: BG_20151211013346_00011.zip
==> ID: BG_20151211013346_00011
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211013346_00011
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211013346_00011
==> extract started at 2019-08-06 05:38:49.067795
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211013346_00011.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 05:47:56.031162
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211014802_00010.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211014802_00010
==> extract completed at 2019-08-06 05:48:49.950296
==> time elapsed: 0:00:53.919134
------------------------------
----> processing file 217 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211014903_00011.zip
==> file: BG_20151211014903_00011.zip
==> ID: BG_20151211014903_00011
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211014903_00011
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211014903_00011
==> extract started at 2019-08-06 05:48:50.398155
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211014903_00011.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 05:58:05.989140
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211020218_00009.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211020218_00009
==> extract completed at 2019-08-06 05:59:02.746331
==> time elapsed: 0:00:56.757191
------------------------------
----> processing file 228 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211020319_00010.zip
==> file: BG_20151211020319_00010.zip
==> ID: BG_20151211020319_00010
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211020319_00010
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211020319_00010
==> extract started at 2019-08-06 05:59:03.011765
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211020319_00010.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 06:07:37.981795
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211021634_00007.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211021634_00007
==> extract completed at 2019-08-06 06:08:32.321821
==> time elapsed: 0:00:54.340026
------------------------------
----> processing file 239 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211021636_00008.zip
==> file: BG_20151211021636_00008.zip
==> ID: BG_20151211021636_00008
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211021636_00008
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211021636_00008
==> extract started at 2019-08-06 06:08:32.707234
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211021636_00008.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 06:16:41.564565
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211022950_00005.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211022950_00005
==> extract completed at 2019-08-06 06:17:38.139203
==> time elapsed: 0:00:56.574638
------------------------------
----> processing file 250 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211023053_00006.zip
==> file: BG_20151211023053_00006.zip
==> ID: BG_20151211023053_00006
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211023053_00006
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211023053_00006
==> extract started at 2019-08-06 06:17:38.513269
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211023053_00006.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 06:26:42.063813
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211024407_00004.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211024407_00004
==> extract completed at 2019-08-06 06:28:33.441304
==> time elapsed: 0:01:51.377491
------------------------------
----> processing file 261 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211024510_00005.zip
==> file: BG_20151211024510_00005.zip
==> ID: BG_20151211024510_00005
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211024510_00005
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211024510_00005
==> extract started at 2019-08-06 06:28:33.855377
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211024510_00005.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 07:11:55.007046
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211025824_00002.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211025824_00002
==> extract completed at 2019-08-06 07:12:56.487321
==> time elapsed: 0:01:01.480275
------------------------------
----> processing file 272 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211025925_00003.zip
==> file: BG_20151211025925_00003.zip
==> ID: BG_20151211025925_00003
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211025925_00003
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211025925_00003
==> extract started at 2019-08-06 07:12:56.987444
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211025925_00003.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 07:23:35.700769
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211030538_00013.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211030538_00013
==> extract completed at 2019-08-06 07:24:43.518119
==> time elapsed: 0:01:07.817350
------------------------------
----> processing file 283 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211030639_00014.zip
==> file: BG_20151211030639_00014.zip
==> ID: BG_20151211030639_00014
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211030639_00014
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211030639_00014
==> extract started at 2019-08-06 07:24:43.977243
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211030639_00014.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 07:34:11.432844
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211031955_00009.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211031955_00009
==> extract completed at 2019-08-06 07:35:08.003095
==> time elapsed: 0:00:56.570251
------------------------------
----> processing file 294 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211032056_00010.zip
==> file: BG_20151211032056_00010.zip
==> ID: BG_20151211032056_00010
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211032056_00010
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211032056_00010
==> extract started at 2019-08-06 07:35:08.466296
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211032056_00010.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 07:43:56.167460
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211033213_00006.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211033213_00006
==> extract completed at 2019-08-06 07:44:50.645822
==> time elapsed: 0:00:54.478362
------------------------------
----> processing file 305 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211033314_00007.zip
==> file: BG_20151211033314_00007.zip
==> ID: BG_20151211033314_00007
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211033314_00007
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211033314_00007
==> extract started at 2019-08-06 07:44:50.916644
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211033314_00007.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 07:54:25.276003
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211034528_00003.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211034528_00003
==> extract completed at 2019-08-06 07:55:22.508901
==> time elapsed: 0:00:57.232898
------------------------------
----> processing file 316 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211034629_00004.zip
==> file: BG_20151211034629_00004.zip
==> ID: BG_20151211034629_00004
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211034629_00004
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211034629_00004
==> extract started at 2019-08-06 07:55:22.784215
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211034629_00004.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 08:04:44.063709
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211035239_00014.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211035239_00014
==> extract completed at 2019-08-06 08:05:39.377843
==> time elapsed: 0:00:55.314134
------------------------------
----> processing file 327 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211035240_00015.zip
==> file: BG_20151211035240_00015.zip
==> ID: BG_20151211035240_00015
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211035240_00015
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211035240_00015
==> extract started at 2019-08-06 08:05:39.690906
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211035240_00015.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 08:14:42.911722
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211040355_00009.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211040355_00009
==> extract completed at 2019-08-06 08:15:38.839922
==> time elapsed: 0:00:55.928200
------------------------------
----> processing file 338 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211040456_00010.zip
==> file: BG_20151211040456_00010.zip
==> ID: BG_20151211040456_00010
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211040456_00010
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211040456_00010
==> extract started at 2019-08-06 08:15:39.174938
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211040456_00010.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 08:23:57.216902
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211041610_00002.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211041610_00002
==> extract completed at 2019-08-06 08:24:49.027265
==> time elapsed: 0:00:51.810363
------------------------------
----> processing file 349 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211041711_00003.zip
==> file: BG_20151211041711_00003.zip
==> ID: BG_20151211041711_00003
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211041711_00003
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211041711_00003
==> extract started at 2019-08-06 08:24:49.358015
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211041711_00003.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 08:33:44.534428
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211042322_00013.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211042322_00013
==> extract completed at 2019-08-06 08:34:38.040500
==> time elapsed: 0:00:53.506072
------------------------------
----> processing file 360 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211042424_00014.zip
==> file: BG_20151211042424_00014.zip
==> ID: BG_20151211042424_00014
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211042424_00014
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211042424_00014
==> extract started at 2019-08-06 08:34:38.360143
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211042424_00014.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 08:43:32.325278
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211043437_00006.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211043437_00006
==> extract completed at 2019-08-06 08:44:26.694446
==> time elapsed: 0:00:54.369168
------------------------------
----> processing file 371 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211043438_00007.zip
==> file: BG_20151211043438_00007.zip
==> ID: BG_20151211043438_00007
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211043438_00007
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211043438_00007
==> extract started at 2019-08-06 08:44:27.036482
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211043438_00007.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 08:54:14.331144
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211044149_00017.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211044149_00017
==> extract completed at 2019-08-06 08:55:12.010923
==> time elapsed: 0:00:57.679779
------------------------------
----> processing file 382 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211044251_00018.zip
==> file: BG_20151211044251_00018.zip
==> ID: BG_20151211044251_00018
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211044251_00018
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211044251_00018
==> extract started at 2019-08-06 08:55:12.266129
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211044251_00018.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 09:04:26.654821
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211045103_00010.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211045103_00010
==> extract completed at 2019-08-06 09:05:23.021547
==> time elapsed: 0:00:56.366726
------------------------------
----> processing file 393 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211045104_00011.zip
==> file: BG_20151211045104_00011.zip
==> ID: BG_20151211045104_00011
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211045104_00011
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211045104_00011
==> extract started at 2019-08-06 09:05:23.377645
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211045104_00011.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 09:14:17.418238
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211050118_00003.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211050118_00003
==> extract completed at 2019-08-06 09:15:10.498721
==> time elapsed: 0:00:53.080483
------------------------------
----> processing file 404 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211050219_00004.zip
==> file: BG_20151211050219_00004.zip
==> ID: BG_20151211050219_00004
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211050219_00004
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211050219_00004
==> extract started at 2019-08-06 09:15:10.769316
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211050219_00004.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 09:24:16.961574
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211050730_00014.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211050730_00014
==> extract completed at 2019-08-06 09:25:13.827045
==> time elapsed: 0:00:56.865471
------------------------------
----> processing file 415 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211050832_00015.zip
==> file: BG_20151211050832_00015.zip
==> ID: BG_20151211050832_00015
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211050832_00015
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211050832_00015
==> extract started at 2019-08-06 09:25:14.106978
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211050832_00015.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 09:33:44.689473
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211051749_00004.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211051749_00004
==> extract completed at 2019-08-06 09:34:40.356447
==> time elapsed: 0:00:55.666974
------------------------------
----> processing file 426 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211051750_00005.zip
==> file: BG_20151211051750_00005.zip
==> ID: BG_20151211051750_00005
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211051750_00005
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211051750_00005
==> extract started at 2019-08-06 09:34:40.623220
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211051750_00005.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 09:43:48.728881
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211052301_00015.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211052301_00015
==> extract completed at 2019-08-06 09:44:43.363970
==> time elapsed: 0:00:54.635089
------------------------------
----> processing file 437 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211052302_00016.zip
==> file: BG_20151211052302_00016.zip
==> ID: BG_20151211052302_00016
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211052302_00016
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211052302_00016
==> extract started at 2019-08-06 09:44:43.633727
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211052302_00016.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 09:54:44.374249
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211053321_00008.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211053321_00008
==> extract completed at 2019-08-06 09:55:45.518693
==> time elapsed: 0:01:01.144444
------------------------------
----> processing file 448 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211053423_00009.zip
==> file: BG_20151211053423_00009.zip
==> ID: BG_20151211053423_00009
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211053423_00009
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211053423_00009
==> extract started at 2019-08-06 09:55:45.945916
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211053423_00009.zip
TO uncompressed archive folder 

==> extract started at 2019-08-06 10:04:49.972884
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211054235_00003.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211054235_00003
==> extract completed at 2019-08-06 10:05:46.223767
==> time elapsed: 0:00:56.250883
------------------------------
----> processing file 459 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211054236_00004.zip
==> file: BG_20151211054236_00004.zip
==> ID: BG_20151211054236_00004
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211054236_00004
EXISTS, so moving on - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20171002210239_00001
------------------------------
----> processing file 460 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20151211054338_00005.zip
==> fi

==> extract started at 2019-08-06 10:12:07.919337
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20171002204834_00003.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20171002204834_00003
==> extract completed at 2019-08-06 10:13:04.630173
==> time elapsed: 0:00:56.710836
------------------------------
----> processing file 470 of 474
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20171002204837_00004.zip
==> file: BG_20171002204837_00004.zip
==> ID: BG_20171002204837_00004
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20171002204837_00004
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20171002204837_00004
==> extract started at 2019-08-06 10:13:04.931850
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/BostonGlobe/BG_20171002204837_00004.zip
TO uncompressed archive folder 

## Work with uncompressed files

- Back to [Table of Contents](#Table-of-Contents)

Change working directories to the uncompressed paper path.

In [8]:
%cd $uncompressed_paper_path

/mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe


In [9]:
%ls

[0m[01;34mBG_20151210212722_00001[0m/  [01;34mBG_20151211002934_00005[0m/  [01;34mBG_20151211034630_00005[0m/
[01;34mBG_20151210212824_00002[0m/  [01;34mBG_20151211003037_00006[0m/  [01;34mBG_20151211034731_00006[0m/
[01;34mBG_20151210212825_00003[0m/  [01;34mBG_20151211003138_00007[0m/  [01;34mBG_20151211034732_00007[0m/
[01;34mBG_20151210212926_00004[0m/  [01;34mBG_20151211003139_00008[0m/  [01;34mBG_20151211034833_00008[0m/
[01;34mBG_20151210212927_00005[0m/  [01;34mBG_20151211003240_00009[0m/  [01;34mBG_20151211034934_00009[0m/
[01;34mBG_20151210213028_00006[0m/  [01;34mBG_20151211004245_00001[0m/  [01;34mBG_20151211034935_00010[0m/
[01;34mBG_20151210213031_00007[0m/  [01;34mBG_20151211004347_00002[0m/  [01;34mBG_20151211035036_00011[0m/
[01;34mBG_20151210213132_00008[0m/  [01;34mBG_20151211004448_00003[0m/  [01;34mBG_20151211035037_00012[0m/
[01;34mBG_20151210213134_00009[0m/  [01;34mBG_20151211004549_00004[0m/  [01;34m

## load a file into memory.

- Back to [Table of Contents](#Table-of-Contents)

Load one of the files into memory and see what we can do with it.  Beautiful Soup?

Looks like the root element is "Record", then the high-level type of the article is "ObjectType".

ObjectType values:

- Advertisement
- ...

Good options for XML parser:

- `lxml.etree` - [https://stackoverflow.com/questions/12290091/reading-xml-file-and-fetching-its-attributes-value-in-python](https://stackoverflow.com/questions/12290091/reading-xml-file-and-fetching-its-attributes-value-in-python)
- `xmltodict` - [https://docs.python-guide.org/scenarios/xml/](https://docs.python-guide.org/scenarios/xml/)
- `beautifulsoup` using `lxml`

In [22]:
# loop over files in the current archive folder path.

# declare variables
xml_file_list = None
xml_file_path = None
xml_file = None
xml_dict = None
xml_file_counter = None
object_type_to_count_map = None
object_type_count = None
record_node = None
object_type_node = None
object_type_list = None
object_type = None

# declare variables - auditing
xml_file_counter = None
no_record_counter = None
no_object_type_counter = None
no_object_type_text_counter = None

# init
object_type_to_count_map = {}

# get file list.
xml_file_list = glob.glob( "{}/*.xml".format( uncompressed_archive_path ) )

# loop
xml_file_counter = 0
no_record_counter = 0
no_object_type_counter = 0
no_object_type_text_counter = 0
for xml_file_path in xml_file_list:
    
    xml_file_counter += 1
    
    # try to parse the file
    with open( xml_file_path ) as xml_file:
    
        # parse XML
        xml_dict = xmltodict.parse( xml_file.read() )
        
        # get root.Record.ObjectType value
        record_node = xml_dict.get( "Record", None )
        
        if ( record_node is not None ):
            
            # get object type (looks like xmltodict stores
            #     elements with no attributes and no child
            #     elements just as a string contents mapped
            #     to name in parent, no dictionary)
            #
            # so for:
            # <Record>
            #     ...
            #     <ObjectType>Advertisement</ObjectType>
            #     ...
            # </Record>
            #
            # to get value:
            #     record_node = xml_dict.get( "Record", None )
            #     object_type_list = record_node.get( "ObjectType", None )
            #     object_type = "|".join( object_type_list )
            #
            # NOT:
            #     record_node = xml_dict.get( "Record", None )
            #     object_type_node = record_node.get( "ObjectType", None )
            #     object_type = object_type_node.get( "#text", None )
            #
            # Doc that led me astray: https://docs.python-guide.org/scenarios/xml/
            object_type_list = record_node.get( "ObjectType", None )
            object_type = "|".join( object_type_list )

            # got a type?
            if ( ( object_type is not None ) and ( object_type != "" ) ):

                # we do.  Increment count.
                object_type_count = object_type_to_count_map.get( object_type, 0 )
                object_type_count += 1
                object_type_to_count_map[ object_type ] = object_type_count

            else:

                # object type is None
                no_object_type_text_counter += 1

            #-- END check for type value --#
            
        else:
            
            # increment counter
            no_record_counter += 1
            
        #-- END check if we found a "Record" node in root --#

    #-- END with open( xml_file_path )...: --#
    
#-- END loop over XML files --#

print( "XML file count: {}".format( len( xml_file_list ) ) )
print( "Counters:" )
print( "- Processed {} files".format( xml_file_counter ) )
print( "- No Record: {}".format( no_record_counter ) )
print( "- No ObjectType: {}".format( no_object_type_counter ) )
print( "- No ObjectType value: {}".format( no_object_type_text_counter ) )
print( "\nObjectType values and occurrence counts:")
for object_type, object_type_count in six.iteritems( object_type_to_count_map ):
    
    # print type and count
    print( "- {}: {}".format( object_type, object_type_count ) )
    
#-- END loop over object types. --#

XML file count: 5752
Counters:
- Processed 5752 files
- No Record: 0
- No ObjectType: 0
- No ObjectType value: 0

ObjectType values and occurrence counts:
- A|d|v|e|r|t|i|s|e|m|e|n|t: 1902
- Article|Feature: 1792
- N|e|w|s: 53
- Commentary|Editorial: 36
- G|e|n|e|r|a|l| |I|n|f|o|r|m|a|t|i|o|n: 488
- S|t|o|c|k| |Q|u|o|t|e: 185
- Advertisement|Classified Advertisement: 413
- E|d|i|t|o|r|i|a|l| |C|a|r|t|o|o|n|/|C|o|m|i|c: 31
- Correspondence|Letter to the Editor: 119
- Front Matter|Table of Contents: 193
- O|b|i|t|u|a|r|y: 72
- F|r|o|n|t| |P|a|g|e|/|C|o|v|e|r| |S|t|o|r|y: 107
- I|m|a|g|e|/|P|h|o|t|o|g|r|a|p|h: 84
- Marriage Announcement|News: 6
- I|l|l|u|s|t|r|a|t|i|o|n: 91
- R|e|v|i|e|w: 133
- C|r|e|d|i|t|/|A|c|k|n|o|w|l|e|d|g|e|m|e|n|t: 30
- News|Legal Notice: 17


## build list of all ObjectTypes

- Back to [Table of Contents](#Table-of-Contents)

Loop over all folders in the paper path.  For each folder, grab all files in the folder.  For each file, parse XML, then get the ObjectType value and if it isn't already in map of obect types to counts, add it.  Increment count.

From command line, in the uncompressed BostonGlobe folder:

    find . -type f -iname "*.xml" | wc -l

resulted in 11,374,500 articles.  That is quite a few.

In [18]:
xml_folder_list = glob.glob( "{}/*".format( uncompressed_paper_path ) )
print( "folder_list: {}".format( xxml_folder_list ) )

folder_list: ['/mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210212722_00001', '/mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210212824_00002', '/mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210212825_00003', '/mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210212926_00004', '/mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210212927_00005', '/mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210213028_00006', '/mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210213031_00007', '/mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210213132_00008', '/mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210213134_00009', '/mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210213235_00010', '/mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210213236_00011', '/mnt/hgfs/project

In [27]:
# declare variables
xml_file_list = None
xml_file_path = None
xml_file = None
xml_dict = None
xml_file_counter = None
object_type_to_count_map = None
object_type_count = None
record_node = None
object_type_node = None
object_type_list = None
object_type = None

# declare variables - folders
xml_folder_list = None
xml_folder_path = None
xml_folder_count = None
xml_folder_counter = None
xml_folder_start_time = None
xml_folder_end_time = None
xml_folder_duration = None

# declare variables - auditing
xml_file_count = None
xml_file_counter = None
no_record_counter = None
no_object_type_counter = None
no_object_type_text_counter = None

# init
object_type_to_count_map = {}

# first, get the list of folders for the current paper.
xml_folder_list = glob.glob( "{}/*".format( uncompressed_paper_path ) )
xml_folder_count = len( xml_folder_list )

log_message = "Processing {} XML folders in {}".format( xml_folder_count, uncompressed_paper_path )
my_logging_helper.output_debug_message( log_message, do_print_IN = True )

# loop over the folders
xml_folder_counter = 0
for xml_folder_path in xml_folder_list:
    
    xml_folder_counter += 1
    
    # log the folder
    xml_folder_start_time = datetime.datetime.now()
    log_message = "==> Processing XML folder {} ( {} of {} ) @ {}".format( xml_folder_path, xml_folder_counter, xml_folder_count, xml_folder_start_time )
    my_logging_helper.output_debug_message( log_message, do_print_IN = True )

    # get file list.
    xml_file_list = glob.glob( "{}/*.xml".format( xml_folder_path ) )
    xml_file_count = len( xml_file_list )

    # log the count
    log_message = "----> XML file count: {}".format( xml_file_count )
    my_logging_helper.output_debug_message( log_message, do_print_IN = True )

    # loop
    xml_file_counter = 0
    no_record_counter = 0
    no_object_type_counter = 0
    no_object_type_text_counter = 0
    for xml_file_path in xml_file_list:

        xml_file_counter += 1

        # try to parse the file
        with open( xml_file_path ) as xml_file:

            # parse XML
            xml_dict = xmltodict.parse( xml_file.read() )

            # get root.Record.ObjectType value
            record_node = xml_dict.get( "Record", None )

            if ( record_node is not None ):

                # get object type (looks like xmltodict stores
                #     elements with no attributes and no child
                #     elements just as a string contents mapped
                #     to name in parent, no dictionary)
                #
                # so for:
                # <Record>
                #     ...
                #     <ObjectType>Advertisement</ObjectType>
                #     ...
                # </Record>
                #
                # to get value:
                #     record_node = xml_dict.get( "Record", None )
                #     object_type_list = record_node.get( "ObjectType", None )
                #     object_type = "|".join( object_type_list )
                #
                # NOT:
                #     record_node = xml_dict.get( "Record", None )
                #     object_type_node = record_node.get( "ObjectType", None )
                #     object_type = object_type_node.get( "#text", None )
                #
                # Doc that led me astray: https://docs.python-guide.org/scenarios/xml/
                object_type_list = record_node.get( "ObjectType", None )
                object_type = "|".join( object_type_list )

                # got a type?
                if ( ( object_type is not None ) and ( object_type != "" ) ):

                    # we do.  Increment count.
                    object_type_count = object_type_to_count_map.get( object_type, 0 )
                    object_type_count += 1
                    object_type_to_count_map[ object_type ] = object_type_count

                else:

                    # object type is None
                    no_object_type_text_counter += 1

                #-- END check for type value --#

            else:

                # increment counter
                no_record_counter += 1

            #-- END check if we found a "Record" node in root --#

        #-- END with open( xml_file_path )...: --#

    #-- END loop over XML files --#
    
    # log the folder
    xml_folder_end_time = datetime.datetime.now()
    xml_folder_duration = xml_folder_end_time - xml_folder_start_time
    log_message = "----> Processing complete @ {} ( duration {} )\n".format( xml_folder_end_time, xml_folder_duration )
    my_logging_helper.output_debug_message( log_message, do_print_IN = True )

#-- END loop over XML directories --#

log_message = "XML file count: {}".format( len( xml_file_list ) )
my_logging_helper.output_debug_message( log_message, do_print_IN = True )
log_message = "Counters:"
my_logging_helper.output_debug_message( log_message, do_print_IN = True )
log_message = "- Processed {} files".format( xml_file_counter )
my_logging_helper.output_debug_message( log_message, do_print_IN = True )
log_message = "- No Record: {}".format( no_record_counter )
my_logging_helper.output_debug_message( log_message, do_print_IN = True )
log_message = "- No ObjectType: {}".format( no_object_type_counter )
my_logging_helper.output_debug_message( log_message, do_print_IN = True )
log_message = "- No ObjectType value: {}".format( no_object_type_text_counter )
my_logging_helper.output_debug_message( log_message, do_print_IN = True )
log_message = "\nObjectType values and occurrence counts:"
my_logging_helper.output_debug_message( log_message, do_print_IN = True )

for object_type, object_type_count in six.iteritems( object_type_to_count_map ):
    
    # print type and count
    log_message = "- {}: {}".format( object_type, object_type_count )
    my_logging_helper.output_debug_message( log_message, do_print_IN = True )

#-- END loop over object types. --#

Processing 474 XML folders in /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe
==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210212722_00001 ( 1 of 474 ) @ 2019-08-07 19:31:28.237989
----> XML file count: 25000
----> Processing complete @ 2019-08-07 19:32:48.475467 ( duration 0:01:20.237478 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210212824_00002 ( 2 of 474 ) @ 2019-08-07 19:32:48.476118
----> XML file count: 25000
----> Processing complete @ 2019-08-07 19:34:06.237756 ( duration 0:01:17.761638 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210212825_00003 ( 3 of 474 ) @ 2019-08-07 19:34:06.238444
----> XML file count: 25000
----> Processing complete @ 2019-08-07 19:35:25.656844 ( duration 0:01:19.418400 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210212926_00004 ( 4

----> XML file count: 25000
----> Processing complete @ 2019-08-07 20:13:06.568319 ( duration 0:01:25.933434 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210215105_00016 ( 32 of 474 ) @ 2019-08-07 20:13:06.568507
----> XML file count: 25000
----> Processing complete @ 2019-08-07 20:14:31.648745 ( duration 0:01:25.080238 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210215106_00017 ( 33 of 474 ) @ 2019-08-07 20:14:31.649310
----> XML file count: 25000
----> Processing complete @ 2019-08-07 20:15:56.461760 ( duration 0:01:24.812450 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210215207_00018 ( 34 of 474 ) @ 2019-08-07 20:15:56.462260
----> XML file count: 25000
----> Processing complete @ 2019-08-07 20:17:22.193639 ( duration 0:01:25.731379 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_

----> XML file count: 25000
----> Processing complete @ 2019-08-07 20:55:40.803182 ( duration 0:01:18.644045 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210222052_00012 ( 63 of 474 ) @ 2019-08-07 20:55:40.803710
----> XML file count: 25000
----> Processing complete @ 2019-08-07 20:57:01.992274 ( duration 0:01:21.188564 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210222053_00013 ( 64 of 474 ) @ 2019-08-07 20:57:01.992492
----> XML file count: 20604
----> Processing complete @ 2019-08-07 20:58:08.665039 ( duration 0:01:06.672547 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210222758_00001 ( 65 of 474 ) @ 2019-08-07 20:58:08.665582
----> XML file count: 25000
----> Processing complete @ 2019-08-07 20:59:33.085549 ( duration 0:01:24.419967 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_

----> XML file count: 16282
----> Processing complete @ 2019-08-07 21:36:04.485469 ( duration 0:00:51.394807 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210225741_00001 ( 94 of 474 ) @ 2019-08-07 21:36:04.485636
----> XML file count: 25000
----> Processing complete @ 2019-08-07 21:37:24.402955 ( duration 0:01:19.917319 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210225842_00002 ( 95 of 474 ) @ 2019-08-07 21:37:24.403147
----> XML file count: 25000
----> Processing complete @ 2019-08-07 21:38:43.681139 ( duration 0:01:19.277992 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210225943_00003 ( 96 of 474 ) @ 2019-08-07 21:38:43.681321
----> XML file count: 25000
----> Processing complete @ 2019-08-07 21:40:03.260697 ( duration 0:01:19.579376 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_

----> XML file count: 25000
----> Processing complete @ 2019-08-07 22:17:54.982399 ( duration 0:01:32.245557 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210233331_00008 ( 125 of 474 ) @ 2019-08-07 22:17:54.982605
----> XML file count: 25000
----> Processing complete @ 2019-08-07 22:19:16.335120 ( duration 0:01:21.352515 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210233332_00009 ( 126 of 474 ) @ 2019-08-07 22:19:16.335309
----> XML file count: 25000
----> Processing complete @ 2019-08-07 22:20:38.481172 ( duration 0:01:22.145863 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151210233433_00010 ( 127 of 474 ) @ 2019-08-07 22:20:38.481363
----> XML file count: 19912
----> Processing complete @ 2019-08-07 22:21:44.383969 ( duration 0:01:05.902606 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/

----> XML file count: 25000
----> Processing complete @ 2019-08-07 22:57:44.792451 ( duration 0:01:20.208132 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211002730_00002 ( 156 of 474 ) @ 2019-08-07 22:57:44.792950
----> XML file count: 25000
----> Processing complete @ 2019-08-07 22:59:06.839270 ( duration 0:01:22.046320 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211002832_00003 ( 157 of 474 ) @ 2019-08-07 22:59:06.839466
----> XML file count: 25000
----> Processing complete @ 2019-08-07 23:00:29.808545 ( duration 0:01:22.969079 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211002933_00004 ( 158 of 474 ) @ 2019-08-07 23:00:29.808738
----> XML file count: 25000
----> Processing complete @ 2019-08-07 23:01:55.981316 ( duration 0:01:26.172578 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/

----> XML file count: 25000
----> Processing complete @ 2019-08-07 23:40:08.336464 ( duration 0:01:19.975358 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211011320_00003 ( 187 of 474 ) @ 2019-08-07 23:40:08.336656
----> XML file count: 25000
----> Processing complete @ 2019-08-07 23:41:34.463049 ( duration 0:01:26.126393 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211011421_00004 ( 188 of 474 ) @ 2019-08-07 23:41:34.463235
----> XML file count: 25000
----> Processing complete @ 2019-08-07 23:42:55.295750 ( duration 0:01:20.832515 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211011522_00005 ( 189 of 474 ) @ 2019-08-07 23:42:55.295934
----> XML file count: 25000
----> Processing complete @ 2019-08-07 23:44:15.517899 ( duration 0:01:20.221965 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/

----> XML file count: 25000
----> Processing complete @ 2019-08-08 00:23:13.864280 ( duration 0:01:25.589361 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211015004_00012 ( 218 of 474 ) @ 2019-08-08 00:23:13.864975
----> XML file count: 1136
----> Processing complete @ 2019-08-08 00:23:17.602367 ( duration 0:00:03.737392 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211015708_00001 ( 219 of 474 ) @ 2019-08-08 00:23:17.602564
----> XML file count: 25000
----> Processing complete @ 2019-08-08 00:24:43.105655 ( duration 0:01:25.503091 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211015809_00002 ( 220 of 474 ) @ 2019-08-08 00:24:43.106321
----> XML file count: 25000
----> Processing complete @ 2019-08-08 00:26:08.596691 ( duration 0:01:25.490370 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/B

----> XML file count: 25000
----> Processing complete @ 2019-08-08 01:04:28.896457 ( duration 0:01:41.420958 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211022950_00005 ( 249 of 474 ) @ 2019-08-08 01:04:28.896645
----> XML file count: 25000
----> Processing complete @ 2019-08-08 01:06:11.483251 ( duration 0:01:42.586606 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211023053_00006 ( 250 of 474 ) @ 2019-08-08 01:06:11.483427
----> XML file count: 25000
----> Processing complete @ 2019-08-08 01:07:55.285942 ( duration 0:01:43.802515 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211023154_00007 ( 251 of 474 ) @ 2019-08-08 01:07:55.286141
----> XML file count: 25000
----> Processing complete @ 2019-08-08 01:09:35.585124 ( duration 0:01:40.298983 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/

----> XML file count: 25000
----> Processing complete @ 2019-08-08 01:48:47.599889 ( duration 0:01:24.681949 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211030436_00011 ( 280 of 474 ) @ 2019-08-08 01:48:47.600050
----> XML file count: 25000
----> Processing complete @ 2019-08-08 01:50:12.073888 ( duration 0:01:24.473838 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211030537_00012 ( 281 of 474 ) @ 2019-08-08 01:50:12.074419
----> XML file count: 25000
----> Processing complete @ 2019-08-08 01:51:37.176820 ( duration 0:01:25.102401 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211030538_00013 ( 282 of 474 ) @ 2019-08-08 01:51:37.176982
----> XML file count: 25000
----> Processing complete @ 2019-08-08 01:52:59.634946 ( duration 0:01:22.457964 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/

----> XML file count: 25000
----> Processing complete @ 2019-08-08 02:33:02.213743 ( duration 0:01:23.534684 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211033820_00013 ( 311 of 474 ) @ 2019-08-08 02:33:02.213932
----> XML file count: 25000
----> Processing complete @ 2019-08-08 02:34:30.081790 ( duration 0:01:27.867858 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211033821_00014 ( 312 of 474 ) @ 2019-08-08 02:34:30.081989
----> XML file count: 18448
----> Processing complete @ 2019-08-08 02:35:33.858076 ( duration 0:01:03.776087 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211034324_00001 ( 313 of 474 ) @ 2019-08-08 02:35:33.858265
----> XML file count: 25000
----> Processing complete @ 2019-08-08 02:37:00.577953 ( duration 0:01:26.719688 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/

----> XML file count: 25000
----> Processing complete @ 2019-08-08 03:21:08.291530 ( duration 0:01:31.980801 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211040659_00013 ( 342 of 474 ) @ 2019-08-08 03:21:08.292075
----> XML file count: 25000
----> Processing complete @ 2019-08-08 03:22:43.906595 ( duration 0:01:35.614520 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211040701_00015 ( 343 of 474 ) @ 2019-08-08 03:22:43.906784
----> XML file count: 25000
----> Processing complete @ 2019-08-08 03:24:15.583837 ( duration 0:01:31.677053 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211040702_00016 ( 344 of 474 ) @ 2019-08-08 03:24:15.584019
----> XML file count: 25000
----> Processing complete @ 2019-08-08 03:25:43.868842 ( duration 0:01:28.284823 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/

----> XML file count: 25000
----> Processing complete @ 2019-08-08 04:06:58.255881 ( duration 0:01:36.276260 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211043640_00009 ( 373 of 474 ) @ 2019-08-08 04:06:58.256083
----> XML file count: 25000
----> Processing complete @ 2019-08-08 04:08:35.332386 ( duration 0:01:37.076303 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211043641_00010 ( 374 of 474 ) @ 2019-08-08 04:08:35.332888
----> XML file count: 25000
----> Processing complete @ 2019-08-08 04:10:29.554060 ( duration 0:01:54.221172 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211043742_00011 ( 375 of 474 ) @ 2019-08-08 04:10:29.554280
----> XML file count: 25000
----> Processing complete @ 2019-08-08 04:13:20.184266 ( duration 0:02:50.629986 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/

----> XML file count: 25000
----> Processing complete @ 2019-08-08 04:59:52.754247 ( duration 0:01:30.048474 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211050219_00004 ( 404 of 474 ) @ 2019-08-08 04:59:52.754427
----> XML file count: 25000
----> Processing complete @ 2019-08-08 05:01:26.592419 ( duration 0:01:33.837992 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211050220_00005 ( 405 of 474 ) @ 2019-08-08 05:01:26.592611
----> XML file count: 25000
----> Processing complete @ 2019-08-08 05:03:00.360730 ( duration 0:01:33.768119 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211050321_00006 ( 406 of 474 ) @ 2019-08-08 05:03:00.361292
----> XML file count: 25000
----> Processing complete @ 2019-08-08 05:04:40.058206 ( duration 0:01:39.696914 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/

----> XML file count: 25000
----> Processing complete @ 2019-08-08 05:48:30.878792 ( duration 0:01:29.726732 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211052259_00014 ( 435 of 474 ) @ 2019-08-08 05:48:30.879358
----> XML file count: 25000
----> Processing complete @ 2019-08-08 05:50:00.312414 ( duration 0:01:29.433056 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211052301_00015 ( 436 of 474 ) @ 2019-08-08 05:50:00.312985
----> XML file count: 25000
----> Processing complete @ 2019-08-08 05:51:27.012364 ( duration 0:01:26.699379 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211052302_00016 ( 437 of 474 ) @ 2019-08-08 05:51:27.012538
----> XML file count: 25000
----> Processing complete @ 2019-08-08 05:52:51.900639 ( duration 0:01:24.888101 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/

----> XML file count: 21039
----> Processing complete @ 2019-08-08 06:38:07.442924 ( duration 0:01:11.680659 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20151211061003_00001 ( 466 of 474 ) @ 2019-08-08 06:38:07.443444
----> XML file count: 993
----> Processing complete @ 2019-08-08 06:38:11.298469 ( duration 0:00:03.855025 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20171002204731_00001 ( 467 of 474 ) @ 2019-08-08 06:38:11.298632
----> XML file count: 25000
----> Processing complete @ 2019-08-08 06:39:50.790955 ( duration 0:01:39.492323 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG_20171002204732_00002 ( 468 of 474 ) @ 2019-08-08 06:39:50.791116
----> XML file count: 25000
----> Processing complete @ 2019-08-08 06:41:39.551712 ( duration 0:01:48.760596 )

==> Processing XML folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/BostonGlobe/BG

XML file count: 5752
Counters:
- Processed 5752 files
- No Record: 0
- No ObjectType: 0
- No ObjectType value: 0

ObjectType values and occurrence counts:
- A|d|v|e|r|t|i|s|e|m|e|n|t: 2114224
- Feature|Article: 5271887
- I|m|a|g|e|/|P|h|o|t|o|g|r|a|p|h: 249942
- O|b|i|t|u|a|r|y: 625143
- G|e|n|e|r|a|l| |I|n|f|o|r|m|a|t|i|o|n: 1083164
- S|t|o|c|k| |Q|u|o|t|e: 202776
- N|e|w|s: 140274
- I|l|l|u|s|t|r|a|t|i|o|n: 106925
- F|r|o|n|t| |P|a|g|e|/|C|o|v|e|r| |S|t|o|r|y: 386421
- E|d|i|t|o|r|i|a|l| |C|a|r|t|o|o|n|/|C|o|m|i|c: 78993
- Editorial|Commentary: 156342
- C|r|e|d|i|t|/|A|c|k|n|o|w|l|e|d|g|e|m|e|n|t: 68356
- Classified Advertisement|Advertisement: 291533
- R|e|v|i|e|w: 86889
- Table of Contents|Front Matter: 69798
- Letter to the Editor|Correspondence: 202071
- News|Legal Notice: 24053
- News|Marriage Announcement: 41314
- B|i|r|t|h| |N|o|t|i|c|e: 926
- News|Military/War News: 3
- U|n|d|e|f|i|n|e|d: 5
- Article|Feature: 137526
- Front Matter|Table of Contents: 11195
- Commentary|Editorial: 3386
- Marriage Announcement|News: 683
- Correspondence|Letter to the Editor: 7479
- Legal Notice|News: 1029
- Advertisement|Classified Advertisement: 12163


# TODO

- Back to [Table of Contents](#Table-of-Contents)

TODO:

- figure out which ObjectTypes to explore, pick a folder and just eyeball a few, to see what they look like.