<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Setup" data-toc-modified-id="Setup-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Setup---Debug" data-toc-modified-id="Setup---Debug-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Setup - Debug</a></span></li><li><span><a href="#Setup---Imports" data-toc-modified-id="Setup---Imports-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Setup - Imports</a></span></li><li><span><a href="#Setup---logging" data-toc-modified-id="Setup---logging-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Setup - logging</a></span></li><li><span><a href="#Setup---virtualenv-jupyter-kernel" data-toc-modified-id="Setup---virtualenv-jupyter-kernel-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Setup - virtualenv jupyter kernel</a></span></li><li><span><a href="#Setup---Initialize-Django" data-toc-modified-id="Setup---Initialize-Django-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Setup - Initialize Django</a></span></li><li><span><a href="#Setup---Initialize-LoggingHelper" data-toc-modified-id="Setup---Initialize-LoggingHelper-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Setup - Initialize LoggingHelper</a></span></li></ul></li><li><span><a href="#Find-articles-to-be-loaded" data-toc-modified-id="Find-articles-to-be-loaded-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Find articles to be loaded</a></span><ul class="toc-item"><li><span><a href="#Loading-setup---working-folder-paths" data-toc-modified-id="Loading-setup---working-folder-paths-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Loading setup - working folder paths</a></span></li><li><span><a href="#Uncompress-files" data-toc-modified-id="Uncompress-files-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Uncompress files</a></span></li><li><span><a href="#Work-with-uncompressed-files" data-toc-modified-id="Work-with-uncompressed-files-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Work with uncompressed files</a></span></li><li><span><a href="#load-a-file-into-memory." data-toc-modified-id="load-a-file-into-memory.-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>load a file into memory.</a></span></li></ul></li><li><span><a href="#TODO" data-toc-modified-id="TODO-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>TODO</a></span></li></ul></div>

# Introduction

- Back to [Table of Contents](#Table-of-Contents)

This is a notebook that expands on the OpenCalais code in the file `article_coding.py`, also in this folder.  It includes more sections on selecting publications you want to submit to OpenCalais as an example.  It is intended to be copied and re-used.

# Setup

- Back to [Table of Contents](#Table-of-Contents)

## Setup - Debug

- Back to [Table of Contents](#Table-of-Contents)

In [1]:
debug_flag = False

## Setup - Imports

- Back to [Table of Contents](#Table-of-Contents)

In [2]:
# python packages
import datetime
import glob
import logging
import lxml
import os
import six
import xml
import xmltodict
import zipfile

## Setup - logging

- Back to [Table of Contents](#Table-of-Contents)

configure logging for this notebook's kernel (If you do not run this cell, you'll get the django application's logging configuration.

In [3]:
logging.basicConfig(
    level = logging.DEBUG,
    format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
    filename = '/home/jonathanmorgan/logs/django-research-data_load-Newsday.log.txt',
    filemode = 'w' # set to 'a' if you want to append, rather than overwrite each time.
)

## Setup - virtualenv jupyter kernel

- Back to [Table of Contents](#Table-of-Contents)

If you are using a virtualenv, make sure that you:

- have installed your virtualenv as a kernel.
- choose the kernel for your virtualenv as the kernel for your notebook (Kernel --> Change kernel).

Since I use a virtualenv, need to get that activated somehow inside this notebook.  One option is to run `../dev/wsgi.py` in this notebook, to configure the python environment manually as if you had activated the `sourcenet` virtualenv.  To do this, you'd make a code cell that contains:

    %run ../dev/wsgi.py
    
This is sketchy, however, because of the changes it makes to your Python environment within the context of whatever your current kernel is.  I'd worry about collisions with the actual Python 3 kernel.  Better, one can install their virtualenv as a separate kernel.  Steps:

- activate your virtualenv:

        workon research

- in your virtualenv, install the package `ipykernel`.

        pip install ipykernel

- use the ipykernel python program to install the current environment as a kernel:

        python -m ipykernel install --user --name <env_name> --display-name "<display_name>"
        
    `sourcenet` example:
    
        python -m ipykernel install --user --name sourcenet --display-name "research (Python 3)"
        
More details: [http://ipython.readthedocs.io/en/stable/install/kernel_install.html](http://ipython.readthedocs.io/en/stable/install/kernel_install.html)

## Setup - Initialize Django

- Back to [Table of Contents](#Table-of-Contents)

First, initialize my dev django project, so I can run code in this notebook that references my django models and can talk to the database using my project's settings.

In [4]:
# init django
django_init_folder = "/home/jonathanmorgan/work/django/research/work/phd_work"
django_init_path = "django_init.py"
if( ( django_init_folder is not None ) and ( django_init_folder != "" ) ):
    
    # add folder to front of path.
    django_init_path = "{}/{}".format( django_init_folder, django_init_path )
    
#-- END check to see if django_init folder. --#

In [5]:
%run $django_init_path

django initialized at 2019-08-07 02:45:02.344557


In [6]:
# context_text imports
from context_text.article_coding.article_coding import ArticleCoder
from context_text.article_coding.article_coding import ArticleCoding
from context_text.article_coding.open_calais_v2.open_calais_v2_article_coder import OpenCalaisV2ArticleCoder
from context_text.collectors.newsbank.newspapers.GRPB import GRPB
from context_text.collectors.newsbank.newspapers.DTNB import DTNB
from context_text.models import Article
from context_text.models import Article_Subject
from context_text.models import Newspaper
from context_text.shared.context_text_base import ContextTextBase


## Setup - Initialize LoggingHelper

- Back to [Table of Contents](#Table-of-Contents)

Create a LoggingHelper instance to use to log debug and also print at the same time.

Preconditions: Must be run after Django is initialized, since `python_utilities` is in the django path.

In [7]:
# python_utilities
from python_utilities.logging.logging_helper import LoggingHelper

# init
my_logging_helper = LoggingHelper()
my_logging_helper.set_logger_name( "proquest_hnp-article-loading-Newsday" )
log_message = None

# Find articles to be loaded

- Back to [Table of Contents](#Table-of-Contents)

Specify which folder of XML files should be loaded into system, then process all files within the folder.

The compressed archives from proquest_hnp just contain publication XML files, no containing folder.

To process:

- **uncompresed paper folder ( `<paper_folder>` )** - make a folder in `/mnt/hgfs/projects/phd/proquest_hnp/uncompressed` for the paper whose data you are working with, named the same as the paper's folder in `/mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data`.

    - for example, for the Boston Globe, name it "`BostonGlobe`".

- **uncompressed archive folder ( `<archive_folder>` )** - inside a given paper's folder in uncompressed, for each archive file, create a folder named the same as the archive file, but with no ".zip" at the end.

    - For example, for the file "`BG_20171002210239_00001.zip`", make a folder named "`BG_20171002210239_00001`".
    - path should be "`<paper_folder>/<archive_name_no_zip>`.

- unzip the archive into this folder:

        unzip <path_to_zip> -d <archive_folder>



## Loading setup - working folder paths

- Back to [Table of Contents](#Table-of-Contents)

What data are we looking at?

In [8]:
# paper identifier
paper_identifier = "Newsday"
archive_identifier = None

# source
source_paper_folder = "/mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data"
source_paper_path = "{}/{}".format( source_paper_folder, paper_identifier )
source_archive_file = "{}.zip".format( archive_identifier )
source_archive_path = "{}/{}".format( source_paper_path, source_archive_file )

# uncompressed
uncompressed_paper_folder = "/mnt/hgfs/projects/phd/proquest_hnp/uncompressed"
uncompressed_paper_path = "{}/{}".format( uncompressed_paper_folder, paper_identifier )

# make sure an identifier is set before you make a path here.
if ( ( archive_identifier is not None ) and ( archive_identifier != "" ) ):
    
    # identifier is set.
    uncompressed_archive_path = "{}/{}".format( uncompressed_paper_path, archive_identifier )

#-- END check to see if archive_identifier present. --#

## Uncompress files

- Back to [Table of Contents](#Table-of-Contents)

See if the uncompressed paper folder exists.  If not, set flag and create it.

In [9]:
# declare variables
did_uncomp_paper_folder_exist = False

# check if uncompressed paper folder exists.
if not os.path.exists( uncompressed_paper_path ):
    
    # no.  Make it.
    os.makedirs( uncompressed_paper_path )
    did_uncomp_paper_folder_exist = False
    log_message = "CREATED - Uncompressed paper folder {}".format( uncompressed_paper_path )
    my_logging_helper.output_debug_message( log_message, do_print_IN = True )
    
else:
    
    # yes.  Set flag.
    did_uncomp_paper_folder_exist = True
    log_message = "EXISTS - Uncompressed paper folder {}".format( uncompressed_paper_path )
    my_logging_helper.output_debug_message( log_message, do_print_IN = True )
    
#-- END check to see if paper folder exists.

CREATED - Uncompressed paper folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/Newsday


For each *.zip file in the paper's source folder:

- parse file name from path returned by glob.
- parse the part before ".zip" from the file name.  This is referred to subsequently as the "archive identifier".
- check if folder named the same as the "archive identifier" is present.

    - If no:
    
        - create it.
        - then, uncompress the archive into it.
        
    - If yes:
    
        - output a message.  Don't want to uncompress if it was already uncompressed once.

In [None]:
# declare variables - papers
did_uncomp_paper_folder_exist = None

# declare variables archive (.zip) files.
zip_file_list = None
zip_file_path = None
zip_file_path_parts_list = None
zip_file_name = None
zip_file_name_parts_list = None
archive_identifier = None
uc_archive_folder_path = None
zip_file = None

# declare variables - auditing (uc = uncompressed)
archive_file_counter = None
did_uc_archive_folder_exist = None
uc_folder_exists_counter = None
start_dt = None
end_dt = None
time_delta = None

# use glob to get list of zip files in paper source folder.
zip_file_list = glob.glob( "{}/*.zip".format( source_paper_path ) )
zip_file_count = len( zip_file_list )

log_message = "==> zip file count: {}".format( zip_file_count )
my_logging_helper.output_debug_message( log_message, do_print_IN = True )

# loop over zip files.
archive_file_counter = 0
did_uc_archive_folder_exist = False
uc_folder_exists_counter = 0
for zip_file_path in zip_file_list:
    
    # increment counter
    archive_file_counter += 1
    
    log_message = "----> processing file {} of {}".format( archive_file_counter, zip_file_count )
    my_logging_helper.output_debug_message( log_message, do_print_IN = True )
    
    # get file name
    
    # split path into parts on path separator.
    zip_file_path_parts_list = zip_file_path.split( "/" )

    # file name is the last thing in the list.
    zip_file_name = zip_file_path_parts_list[ -1 ]

    # archive_identifier is name with ".zip" removed from end.
    zip_file_name_parts_list = zip_file_name.split( ".zip" )
    archive_identifier = zip_file_name_parts_list[ 0 ]
    
    # for now, log and print the things we've just created.
    log_message = "==> path: {}".format( zip_file_path )
    my_logging_helper.output_debug_message( log_message, do_print_IN = True )
    log_message = "==> file: {}".format( zip_file_name )
    my_logging_helper.output_debug_message( log_message, do_print_IN = True )
    log_message = "==> ID: {}".format( archive_identifier )
    my_logging_helper.output_debug_message( log_message, do_print_IN = True )

    # check if uncompressed archive folder exists.
    uc_archive_folder_path = "{}/{}".format( uncompressed_paper_path, archive_identifier )

    log_message = "==> TO: {}".format( uc_archive_folder_path )
    my_logging_helper.output_debug_message( log_message, do_print_IN = True )

    # check if the uncompressed archive folder exists.
    did_uc_archive_folder_exist = os.path.exists( uc_archive_folder_path )
    if did_uc_archive_folder_exist == False:

        # no.  Make it.
        os.makedirs( uc_archive_folder_path )
        log_message = "CREATED - Uncompressed archive folder {}".format( uc_archive_folder_path )
        my_logging_helper.output_debug_message( log_message, do_print_IN = True )
        
        # and uncompress archive to it.
        with zipfile.ZipFile( zip_file_path, 'r' ) as zip_file:

            # starting extract.
            start_dt = datetime.datetime.now()
            log_message = "==> extract started at {}".format( start_dt )
            my_logging_helper.output_debug_message( log_message, do_print_IN = True )
            
            # unzip to uncompressed archive folder path.
            zip_file.extractall( uc_archive_folder_path )
            
            log_message = "EXTRACTED - {}".format( zip_file_path )
            my_logging_helper.output_debug_message( log_message, do_print_IN = True )

            log_message = "TO uncompressed archive folder - {}".format( uc_archive_folder_path )
            my_logging_helper.output_debug_message( log_message, do_print_IN = True )
            
            # complete
            end_dt = datetime.datetime.now()
            
            log_message = "==> extract completed at {}".format( end_dt )
            my_logging_helper.output_debug_message( log_message, do_print_IN = True )
            
            log_message = "==> time elapsed: {}".format( end_dt - start_dt )
            my_logging_helper.output_debug_message( log_message, do_print_IN = True )
            
        #-- END with ZipFile --#

    else:

        # yes.  Set flag.
        uc_folder_exists_counter += 1
        log_message = "EXISTS, so moving on - Uncompressed archive folder {}".format( uncompressed_archive_path )
        my_logging_helper.output_debug_message( log_message, do_print_IN = True )

    #-- END check to see if archive folder exists. --#

    log_message = "------------------------------"
    my_logging_helper.output_debug_message( log_message, do_print_IN = True )

#-- END loop over zip files. --#

==> zip file count: 194
----> processing file 1 of 194
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/Newsday/Newsday_20171006220113_00001.zip
==> file: Newsday_20171006220113_00001.zip
==> ID: Newsday_20171006220113_00001
==> TO: /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/Newsday/Newsday_20171006220113_00001
CREATED - Uncompressed archive folder /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/Newsday/Newsday_20171006220113_00001
==> extract started at 2019-08-07 02:52:59.761334
EXTRACTED - /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/Newsday/Newsday_20171006220113_00001.zip
TO uncompressed archive folder - /mnt/hgfs/projects/phd/proquest_hnp/uncompressed/Newsday/Newsday_20171006220113_00001
==> extract completed at 2019-08-07 02:53:58.689399
==> time elapsed: 0:00:58.928065
------------------------------
----> processing file 2 of 194
==> path: /mnt/hgfs/projects/phd/proquest_hnp/proquest_hnp/data/Newsday/Newsday_20171006220215_00002.zip
==> file: News

## Work with uncompressed files

- Back to [Table of Contents](#Table-of-Contents)

Change working directories to the uncompressed paper path.

In [None]:
%cd $uncompressed_paper_path

In [None]:
%ls

## load a file into memory.

- Back to [Table of Contents](#Table-of-Contents)

Load one of the files into memory and see what we can do with it.  Beautiful Soup?

Looks like the root element is "Record", then the high-level type of the article is "ObjectType".

ObjectType values:

- Advertisement
- ...

Good options for XML parser:

- `lxml.etree` - [https://stackoverflow.com/questions/12290091/reading-xml-file-and-fetching-its-attributes-value-in-python](https://stackoverflow.com/questions/12290091/reading-xml-file-and-fetching-its-attributes-value-in-python)
- `xmltodict` - [https://docs.python-guide.org/scenarios/xml/](https://docs.python-guide.org/scenarios/xml/)
- `beautifulsoup` using `lxml`

In [None]:
# loop over files in the current archive folder path.

# declare variables
xml_file_list = None
xml_file_path = None
xml_file = None
xml_dict = None
xml_file_counter = None
object_type_to_count_map = None
object_type_count = None
record_node = None
object_type_node = None
object_type_list = None
object_type = None

# declare variables - auditing
xml_file_counter = None
no_record_counter = None
no_object_type_counter = None
no_object_type_text_counter = None

# init
object_type_to_count_map = {}

# get file list.
xml_file_list = glob.glob( "{}/*.xml".format( uc_archive_folder_path ) )

# loop
xml_file_counter = 0
no_record_counter = 0
no_object_type_counter = 0
no_object_type_text_counter = 0
for xml_file_path in xml_file_list:
    
    xml_file_counter += 1
    
    # try to parse the file
    with open( xml_file_path ) as xml_file:
    
        # parse XML
        xml_dict = xmltodict.parse( xml_file.read() )
        
        # get root.Record.ObjectType value
        record_node = xml_dict.get( "Record", None )
        
        if ( record_node is not None ):
            
            # get object type (looks like xmltodict stores
            #     elements with no attributes and no child
            #     elements just as a string contents mapped
            #     to name in parent, no dictionary)
            #
            # so for:
            # <Record>
            #     ...
            #     <ObjectType>Advertisement</ObjectType>
            #     ...
            # </Record>
            #
            # to get value:
            #     record_node = xml_dict.get( "Record", None )
            #     object_type_list = record_node.get( "ObjectType", None )
            #     object_type = "|".join( object_type_list )
            #
            # NOT:
            #     record_node = xml_dict.get( "Record", None )
            #     object_type_node = record_node.get( "ObjectType", None )
            #     object_type = object_type_node.get( "#text", None )
            #
            # Doc that led me astray: https://docs.python-guide.org/scenarios/xml/
            object_type_list = record_node.get( "ObjectType", None )
            object_type = "|".join( object_type_list )

            # got a type?
            if ( ( object_type is not None ) and ( object_type != "" ) ):

                # we do.  Increment count.
                object_type_count = object_type_to_count_map.get( object_type, 0 )
                object_type_count += 1
                object_type_to_count_map[ object_type ] = object_type_count

            else:

                # object type is None
                no_object_type_text_counter += 1

            #-- END check for type value --#
            
        else:
            
            # increment counter
            no_record_counter += 1
            
        #-- END check if we found a "Record" node in root --#

    #-- END with open( xml_file_path )...: --#
    
#-- END loop over XML files --#

print( "XML file count: {}".format( len( xml_file_list ) ) )
print( "Counters:" )
print( "- Processed {} files".format( xml_file_counter ) )
print( "- No Record: {}".format( no_record_counter ) )
print( "- No ObjectType: {}".format( no_object_type_counter ) )
print( "- No ObjectType value: {}".format( no_object_type_text_counter ) )
print( "\nObjectType values and occurrence counts:")
for object_type, object_type_count in six.iteritems( object_type_to_count_map ):
    
    # print type and count
    print( "- {}: {}".format( object_type, object_type_count ) )
    
#-- END loop over object types. --#

# TODO

- Back to [Table of Contents](#Table-of-Contents)

TODO:

- create bash script to uncompress all files in a folder from data to uncompressed, when passed paper and archive identifiers?