<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Setup---Debug" data-toc-modified-id="Setup---Debug-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Setup - Debug</a></span></li><li><span><a href="#Setup---Imports" data-toc-modified-id="Setup---Imports-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Setup - Imports</a></span></li><li><span><a href="#Setup---virtualenv-jupyter-kernel" data-toc-modified-id="Setup---virtualenv-jupyter-kernel-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Setup - virtualenv jupyter kernel</a></span></li><li><span><a href="#Setup---Initialize-Django" data-toc-modified-id="Setup---Initialize-Django-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Setup - Initialize Django</a></span></li></ul></li><li><span><a href="#Tag-articles-to-be-coded" data-toc-modified-id="Tag-articles-to-be-coded-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Tag articles to be coded</a></span><ul class="toc-item"><li><span><a href="#which-articles-need-to-be-tagged?" data-toc-modified-id="which-articles-need-to-be-tagged?-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>which articles need to be tagged?</a></span></li><li><span><a href="#tag-all-local-news" data-toc-modified-id="tag-all-local-news-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>tag all local news</a></span><ul class="toc-item"><li><span><a href="#Grand-Rapids-Press-local-news" data-toc-modified-id="Grand-Rapids-Press-local-news-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Grand Rapids Press local news</a></span></li><li><span><a href="#Detroit-News-local-news" data-toc-modified-id="Detroit-News-local-news-2.2.2"><span class="toc-item-num">2.2.2&nbsp;&nbsp;</span>Detroit News local news</a></span></li></ul></li></ul></li><li><span><a href="#Code-Articles" data-toc-modified-id="Code-Articles-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Code Articles</a></span></li></ul></div>

# Setup

- Back to [Table of Contents](#Table-of-Contents)

## Setup - Debug

- Back to [Table of Contents](#Table-of-Contents)

In [1]:
debug_flag = True

## Setup - Imports

- Back to [Table of Contents](#Table-of-Contents)

In [17]:
import datetime
from django.db.models import Avg, Max, Min
import six

## Setup - virtualenv jupyter kernel

- Back to [Table of Contents](#Table-of-Contents)

If you are using a virtualenv, make sure that you:

- have installed your virtualenv as a kernel.
- choose the kernel for your virtualenv as the kernel for your notebook (Kernel --> Change kernel).

Since I use a virtualenv, need to get that activated somehow inside this notebook.  One option is to run `../dev/wsgi.py` in this notebook, to configure the python environment manually as if you had activated the `sourcenet` virtualenv.  To do this, you'd make a code cell that contains:

    %run ../dev/wsgi.py
    
This is sketchy, however, because of the changes it makes to your Python environment within the context of whatever your current kernel is.  I'd worry about collisions with the actual Python 3 kernel.  Better, one can install their virtualenv as a separate kernel.  Steps:

- activate your virtualenv:

        workon research

- in your virtualenv, install the package `ipykernel`.

        pip install ipykernel

- use the ipykernel python program to install the current environment as a kernel:

        python -m ipykernel install --user --name <env_name> --display-name "<display_name>"
        
    `sourcenet` example:
    
        python -m ipykernel install --user --name sourcenet --display-name "research (Python 3)"
        
More details: [http://ipython.readthedocs.io/en/stable/install/kernel_install.html](http://ipython.readthedocs.io/en/stable/install/kernel_install.html)

## Setup - Initialize Django

- Back to [Table of Contents](#Table-of-Contents)

First, initialize my dev django project, so I can run code in this notebook that references my django models and can talk to the database using my project's settings.

In [5]:
# init django
django_init_folder = "/home/jonathanmorgan/work/django/research/work/phd_work"
django_init_path = "django_init.py"
if( ( django_init_folder is not None ) and ( django_init_folder != "" ) ):
    
    # add folder to front of path.
    django_init_path = "{}/{}".format( django_init_folder, django_init_path )
    
#-- END check to see if django_init folder. --#

In [6]:
%run $django_init_path

django initialized at 2019-07-01 17:31:48.768211


In [15]:
# context_text imports
from context_text.models import Article
from context_text.article_coding.article_coding import ArticleCoder
from context_text.article_coding.article_coding import ArticleCoding
from context_text.article_coding.open_calais_v2.open_calais_v2_article_coder import OpenCalaisV2ArticleCoder

# Tag articles to be coded

- Back to [Table of Contents](#Table-of-Contents)

Tag all locally implemented hard news articles in database, thenwork through using OpenCalais to code them all, starting with those proximal to the coding sample for methods paper.

## which articles need to be tagged?

- Back to [Table of Contents](#Table-of-Contents)

More precisely, find all articles that have Article_Data coded by the automated coder with type "OpenCalais_REST_API_v2" and tag the articles as "coded-open_calais_v2" or something like that.

Then, for articles without that tag, use our criteria for local hard news to filter out and tag publications in the year before and after the month used to evaluate the automated coder, in both the Grand Rapids Press and the Detroit News, so I can look at longer time frames, then code all articles currently in database.

Eventually, then, we'll code and examine before and after layoffs.

In [11]:
# look for publications that have article data:
# - coded by automated coder
# - with coder type of "OpenCalais_REST_API_v2"

# get automated coder
automated_coder_user = ArticleCoder.get_automated_coding_user()

print( "{} - Loaded automated user: {}, id = {}".format( datetime.datetime.now(), automated_coder_user, automated_coder_user.id ) )

2019-07-01 17:49:56.250169 - Loaded automated user: automated, id = 2


In [35]:
# try aggregates
article_qs = Article.objects.all()
pub_date_info = article_qs.aggregate( Max( 'pub_date' ), Min( 'pub_date' ) )
print( pub_date_info )

{'pub_date__max': datetime.date(2010, 11, 30), 'pub_date__min': datetime.date(2005, 1, 1)}


In [16]:
# find articles with Article_Data created by the automated user.
article_qs = Article.objects.filter( article_data__coder = automated_coder_user )
article_qs = article_qs.filter( article_data__coder_type = OpenCalaisV2ArticleCoder.CONFIG_APPLICATION )
article_count = article_qs.count()
print( "Found {} articles".format( article_count ) )

Found 892 articles


In [28]:
# profile these publications
min_pub_date = None
max_pub_date = None
current_pub_date = None
pub_date_count = None
date_to_count_map = {}
date_to_articles_map = {}
pub_date_article_dict = None

# try aggregates
pub_date_info = article_qs.aggregate( Max( 'pub_date' ), Min( 'pub_date' ) )
print( pub_date_info )

# counts of pubs by date
for current_article in article_qs:
    
    # get pub_date
    current_pub_date = current_article.pub_date
    current_article_id = current_article.id
    
    # get count, increment, and store.
    pub_date_count = date_to_count_map.get( current_pub_date, 0 )
    pub_date_count += 1
    date_to_count_map[ current_pub_date ] = pub_date_count
    
    # also, store up ids and instances
    
    # get dict of article ids to article instances for date
    pub_date_article_dict = date_to_articles_map.get( current_pub_date, {} )
    
    # article already there?
    if ( current_article_id not in pub_date_article_dict ):
        
        # no - add it.
        pub_date_article_dict[ current_article_id ] = current_article
        
    #-- END check to see if article already there.
    
    # put dict back.
    date_to_articles_map[ current_pub_date ] = pub_date_article_dict
    
#-- END loop over articles. --#

# output dates and counts.

# get list of keys from map
keys_list = list( six.viewkeys( date_to_count_map ) )
keys_list.sort()
for current_pub_date in keys_list:
    
    # get count
    pub_date_count = date_to_count_map.get( current_pub_date, 0 )
    print( "- {} ( {} ) count: {}".format( current_pub_date, type( current_pub_date ), pub_date_count ) )
    
#-- END loop over dates --#

{'pub_date__max': datetime.date(2010, 7, 31), 'pub_date__min': datetime.date(2005, 1, 7)}
- 2005-01-07 ( <class 'datetime.date'> ) count: 1
- 2005-01-21 ( <class 'datetime.date'> ) count: 1
- 2005-02-07 ( <class 'datetime.date'> ) count: 2
- 2005-02-10 ( <class 'datetime.date'> ) count: 1
- 2005-05-10 ( <class 'datetime.date'> ) count: 1
- 2005-05-16 ( <class 'datetime.date'> ) count: 1
- 2005-06-06 ( <class 'datetime.date'> ) count: 1
- 2005-06-22 ( <class 'datetime.date'> ) count: 1
- 2005-07-02 ( <class 'datetime.date'> ) count: 1
- 2005-07-05 ( <class 'datetime.date'> ) count: 5
- 2005-07-07 ( <class 'datetime.date'> ) count: 1
- 2005-08-22 ( <class 'datetime.date'> ) count: 1
- 2005-09-04 ( <class 'datetime.date'> ) count: 1
- 2005-09-11 ( <class 'datetime.date'> ) count: 1
- 2005-09-25 ( <class 'datetime.date'> ) count: 1
- 2005-10-04 ( <class 'datetime.date'> ) count: 1
- 2005-10-20 ( <class 'datetime.date'> ) count: 1
- 2005-10-21 ( <class 'datetime.date'> ) count: 1
- 2005-10-

In [34]:
# look at the 2010-07-31 date
pub_date = datetime.datetime.strptime( "2010-07-31", "%Y-%m-%d" ).date()
articles_for_date = date_to_articles_map.get( pub_date, {} )
print( articles_for_date )

# get the article and look at its tags.
article_instance = articles_for_date.get( 6065 )
print( article_instance.tags.all() )

for article_data in article_instance.article_data_set.all():
    
    print( article_data )

{6065: <Article: 6065 - Jul 31, 2010, City and Region ( A6 ), UID: 1315C0760F2D0668 - Local ArtPeers registration opens ( Grand Rapids Press, The )>}
<QuerySet [<Tag: prelim_reliability_test>, <Tag: prelim_reliability_combined>]>
2180 - minnesota1 - no coder_type -- Article: 6065 - Jul 31, 2010, City and Region ( A6 ), UID: 1315C0760F2D0668 - Local ArtPeers registration opens ( Grand Rapids Press, The )
2200 - minnesota2 - no coder_type -- Article: 6065 - Jul 31, 2010, City and Region ( A6 ), UID: 1315C0760F2D0668 - Local ArtPeers registration opens ( Grand Rapids Press, The )
2281 - minnesota3 - no coder_type -- Article: 6065 - Jul 31, 2010, City and Region ( A6 ), UID: 1315C0760F2D0668 - Local ArtPeers registration opens ( Grand Rapids Press, The )
2969 - automated ( ADCT: OpenCalais_REST_API_v2 )  -- Article: 6065 - Jul 31, 2010, City and Region ( A6 ), UID: 1315C0760F2D0668 - Local ArtPeers registration opens ( Grand Rapids Press, The )


## tag all local news

- Back to [Table of Contents](#Table-of-Contents)

Definition of local hard news by in-house implementor for Grand Rapids Press and Detroit News follow.  For each, tag all articles in database that match as "local_hard_news".

### Grand Rapids Press local news

- Back to [Table of Contents](#Table-of-Contents)

Grand Rapids Press local hard news:

- `context_text/examples/articles/articles-GRP-local_news.py`
- local hard news sections (stored in `Article.GRP_NEWS_SECTION_NAME_LIST`):

    - "Business"
    - "City and Region"
    - "Front Page"
    - "Lakeshore"
    - "Religion"
    - "Special"
    - "State"

- in-house implementor (based on byline patterns, stored in `sourcenet.models.Article.Q_GRP_IN_HOUSE_AUTHOR`):

    - Byline ends in "/ THE GRAND RAPIDS PRESS", ignore case.

        - `Q( author_varchar__iregex = r'.* */ *THE GRAND RAPIDS PRESS$'`

    - Byline ends in "/ PRESS * EDITOR", ignore case.

        - `Q( author_varchar__iregex = r'.* */ *PRESS .* EDITOR$' )`

    - Byline ends in "/ GRAND RAPIDS PRESS * BUREAU", ignore case.

        - `Q( author_varchar__iregex = r'.* */ *GRAND RAPIDS PRESS .* BUREAU$' )`

    - Byline ends in "/ SPECIAL TO THE PRESS", ignore case.

        - `Q( author_varchar__iregex = r'.* */ *SPECIAL TO THE PRESS$' )`

### Detroit News local news

- Back to [Table of Contents](#Table-of-Contents)

Detroit News local news:

- `context_text/examples/articles/articles-TDN-local_news.py`
- local hard news sections (stored in `from context_text.collectors.newsbank.newspapers.DTNB import DTNB` - `DTNB.NEWS_SECTION_NAME_LIST`):

    - "Business"
    - "Metro"
    - "Nation" - because of auto industry stories

- in-house implementor (based on byline patterns, stored in `DTNB.Q_IN_HOUSE_AUTHOR`):

    - Byline ends in "/ The Detroit News", ignore case.

        - `Q( author_varchar__iregex = r'.*\s*/\s*the\s*detroit\s*news$' )`

    - Byline ends in "Special to The Detroit News", ignore case.

        - `Q( author_varchar__iregex = r'.*\s*/\s*special\s*to\s*the\s*detroit\s*news$' )`

    - Byline ends in "Detroit News * Bureau", ignore case.

        - `Q( author_varchar__iregex = r'.*\s*/\s*detroit\s*news\s*.*\s*bureau$' )`   

# Code Articles

- Back to [Table of Contents](#Table-of-Contente)

In [None]:
# declare variables

# declare variables - article filter parameters
start_pub_date = None # should be datetime instance
end_pub_date = None # should be datetime instance
tag_in_list = []
paper_id_in_list = []
section_list = []
article_id_in_list = []
params = {}

# declare variables - processing
do_i_print_updates = True
my_article_coding = None
article_qs = None
article_count = -1
coding_status = ""
limit_to = -1
do_coding = False

# declare variables - results
success_count = -1
success_list = None
got_errors = False
error_count = -1
error_dictionary = None
error_article_id = -1
error_status_list = None
error_status = ""
error_status_counter = -1

# first, get a list of articles to code.

# ! Set param values.

# ==> start and end dates
#start_pub_date = "2009-12-06"
#end_pub_date = "2009-12-12"

# ==> tagged articles
#tag_in_list = "prelim_reliability"
#tag_in_list = "prelim_network"
#tag_in_list = "prelim_unit_test_007"
#tag_in_list = [ "prelim_reliability", "prelim_network" ]
#tag_in_list = [ "prelim_reliability_test" ] # 60 articles - Grand Rapids only.
#tag_in_list = [ "prelim_reliability_combined" ] # 87 articles, Grand Rapids and Detroit.
#tag_in_list = [ "prelim_training_001" ]
tag_in_list = [ "grp_month" ]

# ==> IDs of newspapers to include.
#paper_id_in_list = "1"

# ==> names of sections to include.
#section_list = "Lakeshore,Front Page,City and Region,Business"

# ==> just limit to specific articles by ID.
#article_id_in_list = [ 360962 ]
#article_id_in_list = [ 28598 ]
#article_id_in_list = [ 21653, 21756 ]
#article_id_in_list = [ 90948 ]
#article_id_in_list = [ 21627, 21609, 21579 ]
#article_id_in_list = [ 48778 ]
#article_id_in_list = [ 6065 ]
#article_id_in_list = [ 221858 ]
#article_id_in_list = [ 23804, 22630 ]
article_id_in_list = [ 23804 ]

# filter parameters
params[ ArticleCoding.PARAM_START_DATE ] = start_pub_date
params[ ArticleCoding.PARAM_END_DATE ] = end_pub_date
params[ ArticleCoding.PARAM_TAG_LIST ] = tag_in_list
params[ ArticleCoding.PARAM_PUBLICATION_LIST ] = paper_id_in_list
params[ ArticleCoding.PARAM_SECTION_LIST ] = section_list
params[ ArticleCoding.PARAM_ARTICLE_ID_LIST ] = article_id_in_list

# set coder you want to use.

# OpenCalais REST API v.2
params[ ArticleCoding.PARAM_CODER_TYPE ] = ArticleCoding.ARTICLE_CODING_IMPL_OPEN_CALAIS_API_V2

# get instance of ArticleCoding
my_article_coding = ArticleCoding()
my_article_coding.do_print_updates = do_i_print_updates

# set params
my_article_coding.store_parameters( params )

# create query set - ArticleCoding does the filtering for you.
article_qs = my_article_coding.create_article_query_set()

# limit for an initial test?
if ( ( limit_to is not None ) and ( isinstance( limit_to, int ) == True ) and ( limit_to > 0 ) ):

    # yes.
    article_qs = article_qs[ : limit_to ]

#-- END check to see if limit --#

# get article count
article_count = article_qs.count()

print( "Query params:" )
print( params )
print( "Matching article count: " + str( article_count ) )

# Do coding?
if ( do_coding == True ):

    print( "do_coding == True - it's on!" )

    # yes - make sure we have at least one article:
    if ( article_count > 0 ):

        # invoke the code_article_data( self, query_set_IN ) method.
        coding_status = my_article_coding.code_article_data( article_qs )
    
        # output status
        print( "\n\n==============================\n\nCoding status: \"" + coding_status + "\"" )
        
        # get success count
        success_count = my_article_coding.get_success_count()
        print( "\n\n====> Count of articles successfully processed: " + str( success_count ) )    
        
        # if successes, list out IDs.
        if ( success_count > 0 ):
        
            # there were successes.
            success_list = my_article_coding.get_success_list()
            print( "- list of successfully processed articles: " + str( success_list ) )
        
        #-- END check to see if successes. --#
        
        # got errors?
        got_errors = my_article_coding.has_errors()
        if ( got_errors == True ):
        
            # get error dictionary
            error_dictionary = my_article_coding.get_error_dictionary()
            
            # get error count
            error_count = len( error_dictionary )
            print( "\n\n====> Count of articles with errors: " + str( error_count ) )
            
            # loop...
            for error_article_id, error_status_list in six.iteritems( error_dictionary ):
            
                # output errors for this article.
                print( "- errors for article ID " + str( error_article_id ) + ":" )
                
                # loop over status messages.
                error_status_counter = 0
                for error_status in error_status_list:
                
                    # increment status
                    error_status_counter += 1

                    # print status
                    print( "----> status #" + str( error_status_counter ) + ": " + error_status )
                    
                #-- END loop over status messages. --#
            
            #-- END loop over articles. --#
   
        #-- END check to see if errors --#
    
    #-- END check to see if article count. --#
    
else:
    
    # output matching article count.
    print( "do_coding == False, so dry run" )
    
#-- END check to see if we do_coding --#
