**_test_data-update_data-network_output.ipynb_ - Update unit test data to include network_output test info**

- derived from: [newsbank-article_coding-unittest.ipynb](../article_coding/newsbank-article_coding-unittest.ipynb)

# work

- update test data:

    - // load existing fixtures.
    - // includes 43 coded with OpenCalais.
    - // make sure we can generate network data from them.
    - single-name data
    
        - there are two single-name sources.

    - things to include in actual data:
        - build out network data for a few different specs (one with single name, one without).

            - specs are in [analysis-network_data_output_example.ipynb](./analysis/analysis-network_data_output_example.ipynb)

        - for each data spec, in NetworkDataOutputLog, capture output from no-single-names for original data and data where set records have single-names introduced, both with details on and details off.
        - also get the hashes and length of output strings you'd expect and store the values in the test case.
        - tag a few Article_Subjects with a random tag ("name_error"...?), for testing removing Article_Subjects that have been assigned tag.

            - Tag `from_press_release` added to the following `Article_Subject` instances:

                - 740 - granholm (person 102)
                - 637 - Mark Meadows (person 224)
                - 677 - Gary Nelund (person 261)

            - Tag `godwin_heights` added to the following `Article_Subject` instances:

                - 623 - Felske, Jon (person 188)
                - 622 - Johnston, Allen E. (person 187)
                - 621 - Hornecker, Kenneth (person 189)

    - remove superuser user from auth.
    - re-export the "export" fixture that includes the network data output log and the tags (should just need those two).

- make sure existing unit tests work with new data.
- new unit tests:

    - unit test code is in [analysis-network_data_output_example.ipynb](./analysis/analysis-network_data_output_example.ipynb)
    - simple network data creation test - run with a few specs against test data, make sure I get the right size of output back for each.
    - even lower level, make tests for the method to build person dictionaries, and the base lookup method.
    - also apply a tag to a few article_subject and test for omitting tags? Same as omitting single names (or stacked?).
    - ? - make sure the Article_Data method `filter_article_persons()` works as I intend. To start, create tests in notebook against actual database, using full Article_Subject and Article_Author QuerySets, compare numbers to raw queries. Then, do the same against test database, use numbers to create unit tests. This should be covered by simple network creation tests (they call this method).

# Setup

- Back to [Table of Contents](#Table-of-Contents)

## Setup - Debug

- Back to [Table of Contents](#Table-of-Contents)

In [None]:
debug_flag = False

## Setup - Imports

- Back to [Table of Contents](#Table-of-Contents)

In [None]:
import datetime
from django.db.models import Avg, Max, Min
import hashlib
import json
import logging
import six

print( "packages imported at " + str( datetime.datetime.now() ) )

## Setup - working folder paths

- Back to [Table of Contents](#Table-of-Contents)

In [None]:
%pwd

In [None]:
# current working folder
current_working_folder = "/home/jonathanmorgan/work/django/research/research/work/phd_work/analysis"
current_datetime = datetime.datetime.now()
current_date_string = current_datetime.strftime( "%Y-%m-%d-%H-%M-%S" )

## Setup - logging

- Back to [Table of Contents](#Table-of-Contents)

configure logging for this notebook's kernel (If you do not run this cell, you'll get the django application's logging configuration.

In [None]:
# build file name
logging_file_name = "{}/article_coding-{}.log.txt".format( current_working_folder, current_date_string )

# set up logging.
logging.basicConfig(
    level = logging.DEBUG,
    format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
    filename = logging_file_name,
    filemode = 'w' # set to 'a' if you want to append, rather than overwrite each time.
)

## Setup - virtualenv jupyter kernel

- Back to [Table of Contents](#Table-of-Contents)

If you are using a virtualenv, make sure that you:

- have installed your virtualenv as a kernel.
- choose the kernel for your virtualenv as the kernel for your notebook (Kernel --> Change kernel).

Since I use a virtualenv, need to get that activated somehow inside this notebook.  One option is to run `../dev/wsgi.py` in this notebook, to configure the python environment manually as if you had activated the `sourcenet` virtualenv.  To do this, you'd make a code cell that contains:

    %run ../dev/wsgi.py
    
This is sketchy, however, because of the changes it makes to your Python environment within the context of whatever your current kernel is.  I'd worry about collisions with the actual Python 3 kernel.  Better, one can install their virtualenv as a separate kernel.  Steps:

- activate your virtualenv:

        workon research

- in your virtualenv, install the package `ipykernel`.

        pip install ipykernel

- use the ipykernel python program to install the current environment as a kernel:

        python -m ipykernel install --user --name <env_name> --display-name "<display_name>"
        
    `sourcenet` example:
    
        python -m ipykernel install --user --name sourcenet --display-name "research (Python 3)"
        
More details: [http://ipython.readthedocs.io/en/stable/install/kernel_install.html](http://ipython.readthedocs.io/en/stable/install/kernel_install.html)

## Setup - Initialize Django

- Back to [Table of Contents](#Table-of-Contents)

First, initialize my dev django project, so I can run code in this notebook that references my django models and can talk to the database using my project's settings.

In [None]:
# init django
django_init_folder = "/home/jonathanmorgan/work/django/research/research/work/phd_work"
django_init_path = "django_init.py"
if( ( django_init_folder is not None ) and ( django_init_folder != "" ) ):
    
    # add folder to front of path.
    django_init_path = "{}/{}".format( django_init_folder, django_init_path )
    
#-- END check to see if django_init folder. --#

In [None]:
%run $django_init_path

### Setup - django-related imports

In [None]:
# python utilities
from python_utilities.strings.string_helper import StringHelper

# django imports
from django.contrib.auth.models import User
from django.db.models import Max
from django.db.models import Min

# sourcenet imports
from context_text.shared.context_text_base import ContextTextBase

# context_analysis imports
from context_analysis.network.network_person_info import NetworkPersonInfo

# sourcenet imports
from context_text.models import Article
from context_text.models import Article_Author
from context_text.models import Article_Data
from context_text.models import Article_Subject
from context_text.models import Newspaper
from context_text.models import Person

# article coding
from context_text.article_coding.article_coder import ArticleCoder
from context_text.article_coding.article_coding import ArticleCoding
from context_text.article_coding.open_calais_v2.open_calais_v2_article_coder import OpenCalaisV2ArticleCoder
from context_text.article_coding.open_calais_v2.open_calais_v2_api_response import OpenCalaisV2ApiResponse

# article data collection
from context_text.collectors.newsbank.newspapers.GRPB import GRPB
from context_text.collectors.newsbank.newspapers.DTNB import DTNB

# import class that actually processes requests for outputting networks.
from context_text.export.network_output import NetworkOutput

# context_text shared
from context_text.shared.context_text_base import ContextTextBase

print( "django model packages imported at " + str( datetime.datetime.now() ) )

## Setup - Initialize LoggingHelper

- Back to [Table of Contents](#Table-of-Contents)

Create a LoggingHelper instance to use to log debug and also print at the same time.

Preconditions: Must be run after Django is initialized, since `python_utilities` is in the django path.

In [None]:
# python_utilities
from python_utilities.logging.logging_helper import LoggingHelper

# init
my_logging_helper = LoggingHelper()
my_logging_helper.set_logger_name( "newsbank-article_coding-unittest" )
log_message = None

## Setup - load fixtures and prepare database

- Back to [Table of Contents](#Table-of-Contents)

Detailed instructions: [https://github.com/jonathanmorgan/context_text#using-unittest-data-for-development](https://github.com/jonathanmorgan/context_text#using-unittest-data-for-development)

Create test database, load fixtures, etc.:

- create a database where the unit test data can live.  I usually call it the name of the main production database ("`research`") followed by "`_test`".  Easiest way to do this is to just create the database, then give the same user you use for your production database the same access they have for production for this test database as well.

    - postgresql example, where production database name is "`research`" and database user is "`django_user`":

            CREATE DATABASE research_test OWNER django_user;
            GRANT ALL PRIVILEGES ON DATABASE research_test TO django_user;

- update the DATABASES dictionary in settings.py of the application that contains context_text to point to your test database (in easy example above, could just change the 'NAME' attribute in the 'default' entry to "`research_test`" rather than "`research`".
- cd into your django application's home directory, activate your virtualenv if you created one, then run "`python manage.py migrate`" to create all the tables in the database.

        cd <django_app_directory>
        workon research
        python manage.py migrate

- use the command "`python manage.py createsuperuser`" to make an admin user, for logging into the django admins.

        python manage.py createsuperuser

- load the unit test fixtures into the database, including "export" data with `Article_Data` coded by OpenCalais v.2:

        python manage.py loaddata context_text_unittest_export_auth_data.json
        python manage.py loaddata context_text_unittest_django_config_data.json
        python manage.py loaddata context_text_unittest_export_data.json
        python manage.py loaddata context_text_unittest_export_taggit_data.json
        python manage.py loaddata context-sourcenet_entity_and_relation_types.json

- Then, you can set the OpenCalais v.2 Access Token `django_config` property (application = “OpenCalais_REST_API_v2”; property name = “open_calais_access_token”) to your OpenCalais Token value.  This should let OpenCalais work correctly on this database.

## Setup - shared variables

In [None]:
# get ArticleCoding instance.
#article_coding = ArticleCoding()

# automated coding user
automated_coder = ArticleCoder.get_automated_coding_user()

# newspapers for Grand Rapids Press and Detroit News.
grand_rapids_press = Newspaper.objects.get( newsbank_code = "GRPB" )
detroit_news = Newspaper.objects.get( newsbank_code = "DTNB" )

# OpenCalais v2 coder type
ocv2_coder_type = OpenCalaisV2ArticleCoder.CONFIG_APPLICATION

## Setup - functions

### Setup - function `make_string_hash()`

In [None]:
def make_string_hash( value_IN, hash_function_IN = hashlib.sha256 ):

    # return reference
    value_OUT = None

    # declare variables
    me = "make_string_hash"

    # call StringHelper method.
    value_OUT = StringHelper.make_string_hash( value_IN, hash_function_IN = hash_function_IN )

    return value_OUT

#-- END function make_string_hash() --#

print( "function make_string_hash() defined at " + str( datetime.datetime.now() ) )

# Examine articles and `Article_Data`

- Back to [Table of Contents](#Table-of-Contents)

Tag all locally implemented hard news articles in database and all that have already been coded using Open Calais V2, then work through using OpenCalais to code all local hard news that hasn't alredy been coded, starting with those proximal to the coding sample for methods paper.

## which articles have already been coded?

- Back to [Table of Contents](#Table-of-Contents)

More precisely, find all articles that have Article_Data coded by the automated coder with type "OpenCalais_REST_API_v2" and tag the articles as "coded-open_calais_v2" or something like that.

Then, for articles without that tag, use our criteria for local hard news to filter out and tag publications in the year before and after the month used to evaluate the automated coder, in both the Grand Rapids Press and the Detroit News, so I can look at longer time frames, then code all articles currently in database.

Eventually, then, we'll code and examine before and after layoffs.

In [None]:
# look for publications that have article data:
# - coded by automated coder
# - with coder type of "OpenCalais_REST_API_v2"

# get automated coder
automated_coder_user = ArticleCoder.get_automated_coding_user()

print( "{} - Loaded automated user: {}, id = {}".format( datetime.datetime.now(), automated_coder_user, automated_coder_user.id ) )

In [None]:
# try aggregates
article_qs = Article.objects.all()
pub_date_info = article_qs.aggregate( Max( 'pub_date' ), Min( 'pub_date' ) )
print( pub_date_info )

In [None]:
# find articles with Article_Data created by the automated user...
article_qs = Article.objects.filter( article_data__coder = automated_coder_user )

# ...and specifically coded using OpenCalais V2...
article_qs = article_qs.filter( article_data__coder_type = OpenCalaisV2ArticleCoder.CONFIG_APPLICATION )

# ...and finally, we just want the distinct articles by ID.
article_qs = article_qs.order_by( "id" ).distinct( "id" )

# count?
article_count = article_qs.count()
print( "Found {} articles".format( article_count ) )

### Tag the coded articles

- Back to [Table of Contents](#Table-of-Contents)

Removing duplicates present from joining with Article_Data yields 579 articles that were coded by the automated coder.

Tag all the coded articles with `OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME`.

In [None]:
# declare variables
current_article = None
tag_name_list = None
article_count = None
untagged_count = None
already_tagged_count = None
newly_tagged_count = None
count_sum = None
do_add_tag = False

# init
do_add_tag = False

# get article_count
article_count = article_qs.count()

# loop over articles.
untagged_count = 0
already_tagged_count = 0
newly_tagged_count = 0
for current_article in article_qs:
    
    # get list of tags for this publication
    tag_name_list = current_article.tags.names()
    
    # is the coded tag in the list?
    if ( OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME not in tag_name_list ):
        
        # are we adding tag?
        if ( do_add_tag == True ):

            # add tag.
            current_article.tags.add( OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME )
            newly_tagged_count += 1
            
        else:

            # for now, increment untagged count
            untagged_count += 1
            
        #-- END check to see if we are adding tag. --#
        
    else:
        
        # already tagged
        already_tagged_count += 1
        
    #-- END check to see if coded tag is set --#
    
#-- END loop over articles. --#

print( "Article counts:" )
print( "- total articles: {}".format( article_count ) )
print( "- untagged articles: {}".format( untagged_count ) )
print( "- already tagged: {}".format( already_tagged_count ) )
print( "- newly tagged: {}".format( newly_tagged_count ) )
count_sum = untagged_count + already_tagged_count + newly_tagged_count
print( "- count sum: {}".format( count_sum ) )

### Profile the coded articles

- Back to [Table of Contents](#Table-of-Contents)

Look at range of pub dates.

In [None]:
tags_in_list = []
tags_in_list.append( OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME )
article_qs = Article.objects.filter( tags__name__in = tags_in_list )
print( "Matching article count: {}".format( article_qs.count() ) )

- Original: 579
- after coding 10: 589 (tag is being set correctly by Open Calais V2 coder)
- 2019.08.02 - after 5000 (minus a few errors because 2 seconds isn't quite enough for rate limit): 5518

In [None]:
# profile these publications
min_pub_date = None
max_pub_date = None
current_pub_date = None
pub_date_count = None
date_to_count_map = {}
date_to_articles_map = {}
pub_date_article_dict = None

# try aggregates
pub_date_info = article_qs.aggregate( Max( 'pub_date' ), Min( 'pub_date' ) )
print( pub_date_info )

# counts of pubs by date
for current_article in article_qs:
    
    # get pub_date
    current_pub_date = current_article.pub_date
    current_article_id = current_article.id
    
    # get count, increment, and store.
    pub_date_count = date_to_count_map.get( current_pub_date, 0 )
    pub_date_count += 1
    date_to_count_map[ current_pub_date ] = pub_date_count
    
    # also, store up ids and instances
    
    # get dict of article ids to article instances for date
    pub_date_article_dict = date_to_articles_map.get( current_pub_date, {} )
    
    # article already there?
    if ( current_article_id not in pub_date_article_dict ):
        
        # no - add it.
        pub_date_article_dict[ current_article_id ] = current_article
        
    #-- END check to see if article already there.
    
    # put dict back.
    date_to_articles_map[ current_pub_date ] = pub_date_article_dict
    
#-- END loop over articles. --#

# output dates and counts.

# get list of keys from map
keys_list = list( six.viewkeys( date_to_count_map ) )
keys_list.sort()
for current_pub_date in keys_list:
    
    # get count
    pub_date_count = date_to_count_map.get( current_pub_date, 0 )
    print( "- {} ( {} ) count: {}".format( current_pub_date, type( current_pub_date ), pub_date_count ) )
    
#-- END loop over dates --#

In [None]:
# look at articles for a particular date
focus_date = "2009-12-10"
pub_date = datetime.datetime.strptime( focus_date, "%Y-%m-%d" ).date()
articles_for_date = date_to_articles_map.get( pub_date, {} )

# get each article
for article_id, article_instance in articles_for_date.items():
    
    # look at its tags.
    print( "\n==> Article {article_id}: {article_summary}".format( article_id = article_id, article_summary = article_instance ) )
    print( "- tags: {}".format( article_instance.tags.all() ) )

    # loop over associated Article_Data instances.
    for article_data in article_instance.article_data_set.all():

        print( "----> article_data: {}".format( article_data ) )

    #-- END loop over associated Article_Data instances --#

#-- END loop over articles for date.

## tag all local news

- Back to [Table of Contents](#Table-of-Contents)

Definition of local hard news by in-house implementor for Grand Rapids Press and Detroit News follow.  For each, tag all articles in database that match as "local_hard_news".

### TODO

- Back to [Table of Contents](#Table-of-Contents)

TODO:

- make class for GRPB at NewsBank.

    - also, pull the things that are newspaper specific out of ArticleCoder.py and into the GRPB.py class.

- refine "local news" and "locally created" regular expressions for Grand Rapids Press based on contents of `author_string` and `author_affiliation`.
- do the same for TDN.
- then, use the updated classes and definitions below to flag all local hard news in database for each publication.

#### DONE

- Back to [Table of Contents](#Table-of-Contents)

DONE:

- abstract out shared stuff from GRPB.py and DTNB.py into abstract parent class context_text/collectors/newsbank/newspapers/newsbank_newspaper.py

    - update DTNB.py to use the parent class.
    
- make class for GRPB at NewsBank.

    - context_text/collectors/newsbank/newspapers/GRPB.py

### Grand Rapids Press local news

- Back to [Table of Contents](#Table-of-Contents)

Grand Rapids Press local hard news:

- `context_text/examples/articles/articles-GRP-local_news.py`
- local hard news sections (stored in `Article.GRP_NEWS_SECTION_NAME_LIST`):

    - "Business"
    - "City and Region"
    - "Front Page"
    - "Lakeshore"
    - "Religion"
    - "Special"
    - "State"

- in-house implementor (based on byline patterns, stored in `sourcenet.models.Article.Q_GRP_IN_HOUSE_AUTHOR`):

    - Byline ends in "/ THE GRAND RAPIDS PRESS", ignore case.

        - `Q( author_varchar__iregex = r'.* */ *THE GRAND RAPIDS PRESS$'`

    - Byline ends in "/ PRESS * EDITOR", ignore case.

        - `Q( author_varchar__iregex = r'.* */ *PRESS .* EDITOR$' )`

    - Byline ends in "/ GRAND RAPIDS PRESS * BUREAU", ignore case.

        - `Q( author_varchar__iregex = r'.* */ *GRAND RAPIDS PRESS .* BUREAU$' )`

    - Byline ends in "/ SPECIAL TO THE PRESS", ignore case.

        - `Q( author_varchar__iregex = r'.* */ *SPECIAL TO THE PRESS$' )`
        
- can also exclude columns (I will not):

        grp_article_qs = grp_article_qs.exclude( index_terms__icontains = "Column" )

Need to work to further refine this.

Looking at affiliation strings:

    SELECT author_affiliation, COUNT( author_affiliation ) as affiliation_count
    FROM context_text_article
    WHERE newspaper_id = 1
    GROUP BY author_affiliation
    ORDER BY COUNT( author_affiliation ) DESC;
    
And at author strings for collective bylines:

    SELECT author_string, COUNT( author_string ) as author_count
    FROM context_text_article
    WHERE newspaper_id = 1
    GROUP BY author_string
    ORDER BY COUNT( author_string ) DESC
    LIMIT 10;


In [None]:
# filter queryset to just locally created Grand Rapids Press (GRP) articles.
# imports
from context_text.models import Article
from context_text.models import Newspaper
from context_text.shared.context_text_base import ContextTextBase
from context_text.collectors.newsbank.newspapers.GRPB import GRPB

# declare variables - Grand Rapids Press
do_apply_tag = False
tag_to_apply = None
grp_local_news_sections = []
grp_newspaper = None
grp_article_qs = None
article_count = -1

# declare variables - filtering
include_opinion_columns = True
tags_in_list = []
tags_not_in_list = []
filter_out_prelim_tags = False
random_count = -1

# declare variables - make list of article IDs from QS.
article_id_list = []
article_counter = -1
current_article = None
article_tag_name_list = None
article_update_counter = -1

# ==> configure

# configure - size of random sample we want
#random_count = 60

# configure - also, apply tag?
do_apply_tag = False
tag_to_apply = ContextTextBase.TAG_LOCAL_HARD_NEWS

# set up "local, regional and state news" sections
grp_local_news_sections = GRPB.LOCAL_NEWS_SECTION_NAME_LIST

# Grand Rapids Press
# get newspaper instance for GRP.
grp_newspaper = Newspaper.objects.get( id = GRPB.NEWSPAPER_ID )

# start with all articles
#grp_article_qs = Article.objects.all()

# ==> filter to newspaper, local news section list, and in-house reporters.

# ----> manually

# now, need to find local news articles to test on.
#grp_article_qs = grp_article_qs.filter( newspaper = grp_newspaper )

# only the locally implemented sections
#grp_article_qs = grp_article_qs.filter( section__in = grp_local_news_sections )

# and, with an in-house author
#grp_article_qs = grp_article_qs.filter( Article.Q_GRP_IN_HOUSE_AUTHOR )

#print( "manual filter count: {}".format( grp_article_qs.count() ) )

# ----> using Article.filter_articles()
grp_article_qs = Article.filter_articles( qs_IN = grp_article_qs,
                                          newspaper = grp_newspaper,
                                          section_name_list = grp_local_news_sections,
                                          custom_article_q = GRPB.Q_IN_HOUSE_AUTHOR )

print( "Article.filter_articles count: {}".format( grp_article_qs.count() ) )

# and include opinion columns?
if ( include_opinion_columns == False ):
    
    # do not include columns
    grp_article_qs = grp_article_qs.exclude( index_terms__icontains = "Column" )
    
#-- END check to see if we include columns. --#

'''
# filter to newspaper, section list, and in-house reporters.
grp_article_qs = Article.filter_articles( qs_IN = grp_article_qs,
                                          start_date = "2009-12-01",
                                          end_date = "2009-12-31",
                                          newspaper = grp_newspaper,
                                          section_name_list = grp_local_news_sections,
                                          custom_article_q = Article.Q_GRP_IN_HOUSE_AUTHOR )
'''

# how many is that?
article_count = grp_article_qs.count()

print( "Article count before filtering on tags: " + str( article_count ) )

# ==> tags

# tags to exclude
tags_not_in_list = []

# Example: prelim-related tags
#tags_not_in_list.append( "prelim_reliability" )
#tags_not_in_list.append( "prelim_network" ]
#tags_not_in_list.append( "minnesota1-20160328" )
#tags_not_in_list.append( "minnesota2-20160328" )

# for later - exclude articles already coded.
#tags_not_in_list.append( OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME )

# exclude any already tagged with tag_to_apply
tags_not_in_list.append( tag_to_apply )

if ( ( tags_not_in_list is not None ) and ( len( tags_not_in_list ) > 0 ) ):

    # exclude those in a list
    print( "filtering out articles with tags: " + str( tags_not_in_list ) )
    grp_article_qs = grp_article_qs.exclude( tags__name__in = tags_not_in_list )

#-- END check to see if we have a specific list of tags we want to exclude --#

# include only those with certain tags.
tags_in_list = []

# Examples

# Examples: prelim-related tags
#tags_in_list.append( "prelim_unit_test_001" )
#tags_in_list.append( "prelim_unit_test_002" )
#tags_in_list.append( "prelim_unit_test_003" )
#tags_in_list.append( "prelim_unit_test_004" )
#tags_in_list.append( "prelim_unit_test_005" )
#tags_in_list.append( "prelim_unit_test_006" )
#tags_in_list.append( "prelim_unit_test_007" )

# Example: grp_month
#tags_in_list.append( "grp_month" )

if ( ( tags_in_list is not None ) and ( len( tags_in_list ) > 0 ) ):

    # filter
    print( "filtering to just articles with tags: " + str( tags_in_list ) )
    grp_article_qs = grp_article_qs.filter( tags__name__in = tags_in_list )
    
#-- END check to see if we have a specific list of tags we want to include --#

# filter out "*prelim*" tags?
#filter_out_prelim_tags = True
if ( filter_out_prelim_tags == True ):

    # ifilter out all articles with any tag whose name contains "prelim".
    print( "filtering out articles with tags that contain \"prelim\"" )
    grp_article_qs = grp_article_qs.exclude( tags__name__icontains = "prelim" )
    
#-- END check to see if we filter out "prelim_*" tags --#

# how many is that?
article_count = grp_article_qs.count()

print( "Article count after tag filtering: " + str( article_count ) )

# do we want a random sample?
if ( random_count > 0 ):

    # to get random, order them by "?", then use slicing to retrieve requested
    #     number.
    grp_article_qs = grp_article_qs.order_by( "?" )[ : random_count ]
    
#-- END check to see if we want random sample --#

# this is a nice algorithm, also:
# - http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/

# make ID list, tag articles if configured to.
article_id_list = []
article_counter = 0
article_update_counter = 0
for current_article in grp_article_qs:

    # increment article_counter
    article_counter += 1

    # add IDs to article_id_list
    article_id_list.append( str( current_article.id ) )
    
    # apply a tag while we are at it?
    if ( ( do_apply_tag == True ) and ( tag_to_apply is not None ) and ( tag_to_apply != "" ) ):
    
        # yes, please.  Tag already present?
        article_tag_name_list = current_article.tags.names()
        if ( tag_to_apply not in article_tag_name_list ):

            # Add tag.
            current_article.tags.add( tag_to_apply )
            
            # increment counter
            article_update_counter += 1
            
        #-- END check to see if tag already present. --#
        
    #-- END check to see if we apply tag. --#

    # output the tags.
    if ( debug_flag == True ):
        print( "- Tags for article " + str( current_article.id ) + " : " + str( current_article.tags.all() ) )
    #-- END DEBUG --#

#-- END loop over articles --#

# output the list.
print( "grp_article_qs count: {}".format( grp_article_qs.count() ) )
print( "Found " + str( article_counter ) + " articles ( " + str( article_count ) + " )." )
print( "- Updated {} articles to add tag {}.".format( article_update_counter, tag_to_apply ) )
if ( debug_flag == True ):
    print( "List of " + str( len( article_id_list ) ) + " local GRP staff article IDs: " + ", ".join( article_id_list ) )
#-- END DEBUG --#


### Detroit News local news

- Back to [Table of Contents](#Table-of-Contents)

Detroit News local news:

- `context_text/examples/articles/articles-TDN-local_news.py`
- local hard news sections (stored in `from context_text.collectors.newsbank.newspapers.DTNB import DTNB` - `DTNB.NEWS_SECTION_NAME_LIST`):

    - "Business"
    - "Metro"
    - "Nation" - because of auto industry stories

- in-house implementor (based on byline patterns, stored in `DTNB.Q_IN_HOUSE_AUTHOR`):

    - Byline ends in "/ The Detroit News", ignore case.

        - `Q( author_varchar__iregex = r'.*\s*/\s*the\s*detroit\s*news$' )`

    - Byline ends in "Special to The Detroit News", ignore case.

        - `Q( author_varchar__iregex = r'.*\s*/\s*special\s*to\s*the\s*detroit\s*news$' )`

    - Byline ends in "Detroit News * Bureau", ignore case.

        - `Q( author_varchar__iregex = r'.*\s*/\s*detroit\s*news\s*.*\s*bureau$' )`   

In [None]:
# filter queryset to just locally created Detroit News (TDN) articles.
# imports
from context_text.models import Article
from context_text.models import Newspaper
from context_text.shared.context_text_base import ContextTextBase
from context_text.collectors.newsbank.newspapers.DTNB import DTNB

# declare variables - Detroit News
do_apply_tag = False
tag_to_apply = None
tdn_local_news_sections = []
tdn_newspaper = None
tdn_article_qs = None
article_count = -1

# declare variables - filtering
include_opinion_columns = True
tags_in_list = []
tags_not_in_list = []
filter_out_prelim_tags = False
random_count = -1

# declare variables - make list of article IDs from QS.
article_id_list = []
article_counter = -1
current_article = None

# ==> configure

# configure - size of random sample we want
#random_count = 60

# configure - also, apply tag?
do_apply_tag = False
tag_to_apply = ContextTextBase.TAG_LOCAL_HARD_NEWS

# set up "local, regional and state news" sections
tdn_local_news_sections = DTNB.LOCAL_NEWS_SECTION_NAME_LIST

# Detroit News
# get newspaper instance for TDN.
tdn_newspaper = Newspaper.objects.get( id = DTNB.NEWSPAPER_ID )

# start with all articles
#tdn_article_qs = Article.objects.all()

# ==> filter to newspaper, local news section list, and in-house reporters.

# ----> manually

# now, need to find local news articles to test on.
#tdn_article_qs = tdn_article_qs.filter( newspaper = tdn_newspaper )

# only the locally implemented sections
#tdn_article_qs = tdn_article_qs.filter( section__in = tdn_local_news_sections )

# and, with an in-house author
#tdn_article_qs = tdn_article_qs.filter( DTNB.Q_IN_HOUSE_AUTHOR )

#print( "manual filter count: {}".format( tdn_article_qs.count() ) )

# ----> using Article.filter_articles()
tdn_article_qs = Article.filter_articles( qs_IN = tdn_article_qs,
                                          newspaper = tdn_newspaper,
                                          section_name_list = tdn_local_news_sections,
                                          custom_article_q = DTNB.Q_IN_HOUSE_AUTHOR )

print( "Article.filter_articles count: {}".format( tdn_article_qs.count() ) )

# and include opinion columns?
if ( include_opinion_columns == False ):
    
    # do not include columns
    tdn_article_qs = tdn_article_qs.exclude( author_string__in = DTNB.COLUMNIST_NAME_LIST )
    
#-- END check to see if we include columns. --#

'''
# filter to newspaper, section list, and in-house reporters.
tdn_article_qs = Article.filter_articles( qs_IN = tdn_article_qs,
                                          start_date = "2009-12-01",
                                          end_date = "2009-12-31",
                                          newspaper = tdn_newspaper,
                                          section_name_list = tdn_local_news_sections,
                                          custom_article_q = DTNB.Q_IN_HOUSE_AUTHOR )
'''

# how many is that?
article_count = tdn_article_qs.count()

print( "Article count before filtering on tags: " + str( article_count ) )

# ==> tags

# tags to exclude
#tags_not_in_list = [ "prelim_reliability", "prelim_network" ]
#tags_not_in_list = [ "minnesota1-20160328", "minnesota2-20160328", ]

# for later - exclude articles already coded.
#tags_not_in_list = [ OpenCalaisV2ArticleCoder.TAG_CODED_BY_ME ]

tags_not_in_list = None
if ( ( tags_not_in_list is not None ) and ( len( tags_not_in_list ) > 0 ) ):

    # exclude those in a list
    print( "filtering out articles with tags: " + str( tags_not_in_list ) )
    tdn_article_qs = tdn_article_qs.exclude( tags__name__in = tags_not_in_list )

#-- END check to see if we have a specific list of tags we want to exclude --#

# include only those with certain tags.
#tags_in_list = [ "prelim_unit_test_001", "prelim_unit_test_002", "prelim_unit_test_003", "prelim_unit_test_004", "prelim_unit_test_005", "prelim_unit_test_006", "prelim_unit_test_007" ]
#tags_in_list = [ "tdn_month", ]
tags_in_list = None
if ( ( tags_in_list is not None ) and ( len( tags_in_list ) > 0 ) ):

    # filter
    print( "filtering to just articles with tags: " + str( tags_in_list ) )
    tdn_article_qs = tdn_article_qs.filter( tags__name__in = tags_in_list )
    
#-- END check to see if we have a specific list of tags we want to include --#

# filter out "*prelim*" tags?
#filter_out_prelim_tags = True
if ( filter_out_prelim_tags == True ):

    # ifilter out all articles with any tag whose name contains "prelim".
    print( "filtering out articles with tags that contain \"prelim\"" )
    tdn_article_qs = tdn_article_qs.exclude( tags__name__icontains = "prelim" )
    
#-- END check to see if we filter out "prelim_*" tags --#

# how many is that?
article_count = tdn_article_qs.count()

print( "Article count after tag filtering: " + str( article_count ) )

# do we want a random sample?
if ( random_count > 0 ):

    # to get random, order them by "?", then use slicing to retrieve requested
    #     number.
    tdn_article_qs = tdn_article_qs.order_by( "?" )[ : random_count ]
    
#-- END check to see if we want random sample --#

# this is a nice algorithm, also:
# - http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/

# make ID list, tag articles if configured to.
article_id_list = []
article_counter = 0
for current_article in tdn_article_qs:

    # increment article_counter
    article_counter += 1

    # add IDs to article_id_list
    article_id_list.append( str( current_article.id ) )
    
    # apply a tag while we are at it?
    if ( ( do_apply_tag == True ) and ( tag_to_apply is not None ) and ( tag_to_apply != "" ) ):
    
        # yes, please.  Add tag.
        current_article.tags.add( tag_to_apply )
        
    #-- END check to see if we apply tag. --#

    # output the tags.
    if ( debug_flag == True ):
        print( "- Tags for article " + str( current_article.id ) + " : " + str( current_article.tags.all() ) )
    #-- END DEBUG --#

#-- END loop over articles --#

# output the list.
print( "tdn_article_qs count: {}".format( tdn_article_qs.count() ) )
print( "Found " + str( article_counter ) + " articles ( " + str( article_count ) + " )." )
if ( debug_flag == True ):
    print( "List of " + str( len( article_id_list ) ) + " local TDN staff article IDs: " + ", ".join( article_id_list ) )
#-- END DEBUG --#


# Update data and write unit tests

## Add tags to `Article_Subject` instances

Add the following tags to the following `Article_Subject` instances:


### tag `from_press_release`

- Tag `from_press_release` added to the following `Article_Subject` instances:

    - 637 - Mark Meadows (person 224)
    - 677 - Gary Nelund (person 261)
    - 740 - Jennifer Granholm (person 102)


In [None]:
# add tag `from_press_release` to:
# - 637 - Mark Meadows (person 224)
# - 677 - Gary Nelund (person 261)
# - 740 - Jennifer Granholm (person 102)

# declare variables
tag_to_assign = None
article_subject_id_list = None
article_subject_id = None
article_subject = None

# what tag?
tag_to_assign = "from_press_release"

# build list of Article_Subject instances to process
article_subject_id_list = list()
article_subject_id_list.append( 637 )
article_subject_id_list.append( 677 )
article_subject_id_list.append( 740 )

# loop over Article_Subject instances
for article_subject_id in article_subject_id_list:
    
    # retrieve instance
    article_subject = Article_Subject.objects.get( id = article_subject_id )
    
    # add tag.
    article_subject.tags.add( tag_to_assign )
    
#-- END loop over Article_Subject instances. --#

status_message = "tag {tag_to_assign} added to Article_Subject IDs: {article_subject_id_list} at {timestamp_now}".format(
    tag_to_assign = tag_to_assign,
    article_subject_id_list = article_subject_id_list,
    timestamp_now = datetime.datetime.now()
)
print( status_message )


### tag `godwin_heights`

- Tag `godwin_heights` added to the following `Article_Subject` instances:

    - 621 - Hornecker, Kenneth (person 189)
    - 622 - Johnston, Allen E. (person 187)
    - 623 - Felske, Jon (person 188)


In [None]:
# add tag `godwin_heights` to:
# - 621 - Hornecker, Kenneth (person 189)
# - 622 - Johnston, Allen E. (person 187)
# - 623 - Felske, Jon (person 188)

# declare variables
tag_to_assign = None
article_subject_id_list = None
article_subject_id = None
article_subject = None

# what tag?
tag_to_assign = "godwin_heights"

# build list of Article_Subject instances to process
article_subject_id_list = list()
article_subject_id_list.append( 621 )
article_subject_id_list.append( 622 )
article_subject_id_list.append( 623 )

# loop over Article_Subject instances
for article_subject_id in article_subject_id_list:
    
    # retrieve instance
    article_subject = Article_Subject.objects.get( id = article_subject_id )
    
    # add tag.
    article_subject.tags.add( tag_to_assign )
    
#-- END loop over Article_Subject instances. --#

status_message = "tag {tag_to_assign} added to Article_Subject IDs: {article_subject_id_list} at {timestamp_now}".format(
    tag_to_assign = tag_to_assign,
    article_subject_id_list = article_subject_id_list,
    timestamp_now = datetime.datetime.now()
)
print( status_message )


# Unit tests

To create needed data, run each of the following test network data output specs twice, once with `network_include_render_details` = "yes", and once with `network_include_render_details` = "no".

- _Note: only pass True to `network_outputter.process_network_output_request( debug_flag_IN )` if you really need to debug - it adds garbage data at the end of the output, even if you ask for no render details._

## create network data from "export" unit test data - GRP, all names

- See [`context_text` github README](https://github.com/jonathanmorgan/context_text#test-data) for more details on loading this data.

In [None]:
#include_single_word_names = "yes"
include_single_word_names = "yes"

request_json_string = """{
  "coders": "7",
  "end_date": "2010-02-13",
  "tags_list": "",
  "date_range": "",
  "start_date": "2009-12-07",
  "output_type": "tab_delimited_matrix",
  "publications": "1",
  "network_label": "all_names",
  "person_coders": "7",
  "database_output": "yes",
  "person_end_date": "2010-02-13",
  "person_tag_list": "",
  "coder_types_list": "OpenCalais_REST_API_v2",
  "person_date_range": "",
  "person_query_type": "custom",
  "person_start_date": "2009-12-07",
  "unique_identifiers": "",
  "person_publications": "1",
  "coder_id_priority_list": "",
  "coder_type_filter_type": "automated",
  "network_include_headers": "no",
  "person_coder_types_list": "OpenCalais_REST_API_v2",
  "allow_duplicate_articles": "no",
  "network_data_output_type": "net_and_attr_cols",
  "network_download_as_file": "no",
  "person_unique_identifiers": "",
  "include_source_contact_types": [
    "direct",
    "event",
    "past_quotes",
    "document",
    "other"
  ],
  "person_coder_id_priority_list": "",
  "person_coder_type_filter_type": "automated",
  "network_include_render_details": "no",
  "person_allow_duplicate_articles": "no",
  "exclude_persons_with_tags_in_list": "",
  "include_persons_with_single_word_name": "yes"
}"""
request_json = json.loads( request_json_string )
print( request_json ) 

In [None]:
# Create network data with render details.
request_json[ ContextTextBase.PARAM_NETWORK_INCLUDE_RENDER_DETAILS ] = ContextTextBase.CHOICE_YES
network_outputter = NetworkOutput()
network_data = network_outputter.process_network_output_request(
    params_IN = request_json,
    debug_flag_IN = False
)

In [None]:
# Create network data without render details.
request_json[ ContextTextBase.PARAM_NETWORK_INCLUDE_RENDER_DETAILS ] = ContextTextBase.CHOICE_NO
network_outputter = NetworkOutput()
network_data = network_outputter.process_network_output_request(
    params_IN = request_json,
    debug_flag_IN = False
)

In [None]:
# create a hash of the data, for comparison
network_data_hash = make_string_hash( network_data )
print( "Network data hash: {}".format( network_data_hash ) )

# match?
should_be = "f879560cab27653185bb4e42baec40b6a5d685b4143388e55041399acb921c5f"
if ( network_data_hash != should_be ):
    
    # not right hash. Error.
    print( "ERROR! network data hash is {}, should be {}".format( network_data_hash, should_be ) )
    
else:
    
    # a match
    print( "MATCH - network data hash {} matches expected. hooray!".format( network_data_hash ) )
    
#-- END debug/test --#

In [None]:
network_data_length = len( network_data )
should_be = 14121
print( "Network data length: {}".format( network_data_length ) )
if ( network_data_length != should_be ):
    
    # not right length. Error.
    print( "ERROR! network data length is {}, should be {}".format( network_data_length, should_be ) )
    
else:
    
    # a match
    print( "MATCH - string len()gth of {} matches expected. hooray!".format( network_data_length ) )
    
#-- END debug/test --#

In [None]:
# look at master person dict
master_person_dict = network_outputter.create_person_dict( load_person_IN = True )

# how many entries?
person_count = len( master_person_dict )
print( "- person count: {person_count}".format( person_count = person_count ) )

# right number?
should_be = 74
if ( person_count != should_be ):
    
    # not right length. Error.
    print( "ERROR! person count is {}, should be {}".format( person_count, should_be ) )
    
else:
    
    # a match
    print( "MATCH - person count of {} matches expected. hooray!".format( person_count ) )
    
#-- END debug/test --#

# persons 1049, 752 should be present.
find_person_id = 1049
if ( find_person_id in master_person_dict ):
    
    print( "SUCCESS - single-name person {} is in dictionary".format( find_person_id ) )
    
else:
    
    print( "ERROR - single-name person {} not in dictionary".format( find_person_id ) )
    
#-- END check for person 1049 --#

find_person_id = 752
if ( find_person_id in master_person_dict ):
    
    print( "SUCCESS - single-name person {} is in dictionary".format( find_person_id ) )
    
else:
    
    print( "ERROR - single-name person {} not in dictionary".format( find_person_id ) )
    
#-- END check for person 752 --#

# output all persons.
for person_id, person_instance in master_person_dict.items():
    
    print( "\n==> Person {person_id}: {person_instance}".format( person_id = person_id, person_instance = person_instance ) )
    
#-- END loop over persons --#

## network data from "export" unit test data - no single names

- See [`context_text` github README](https://github.com/jonathanmorgan/context_text#test-data) for more details on loading this data.

In [None]:
#include_single_word_names = "yes"
include_single_word_names = "no"

request_json_string = """{
  "coders": "7",
  "end_date": "2010-02-13",
  "tags_list": "",
  "date_range": "",
  "start_date": "2009-12-07",
  "output_type": "tab_delimited_matrix",
  "publications": "1",
  "network_label": "no_single_names",
  "person_coders": "7",
  "database_output": "yes",
  "person_end_date": "2010-02-13",
  "person_tag_list": "",
  "coder_types_list": "OpenCalais_REST_API_v2",
  "person_date_range": "",
  "person_query_type": "custom",
  "person_start_date": "2009-12-07",
  "unique_identifiers": "",
  "person_publications": "1",
  "coder_id_priority_list": "",
  "coder_type_filter_type": "automated",
  "network_include_headers": "no",
  "person_coder_types_list": "OpenCalais_REST_API_v2",
  "allow_duplicate_articles": "no",
  "network_data_output_type": "net_and_attr_cols",
  "network_download_as_file": "no",
  "person_unique_identifiers": "",
  "include_source_contact_types": "direct,event,past_quotes,document,other",
  "person_coder_id_priority_list": "",
  "person_coder_type_filter_type": "automated",
  "network_include_render_details": "no",
  "person_allow_duplicate_articles": "no",
  "exclude_persons_with_tags_in_list": "",
  "include_persons_with_single_word_name": "no"
}"""
request_json = json.loads( request_json_string )
print( request_json ) 

In [None]:
# Create network data with render details.
request_json[ ContextTextBase.PARAM_NETWORK_INCLUDE_RENDER_DETAILS ] = ContextTextBase.CHOICE_YES
network_outputter = NetworkOutput()
network_data = network_outputter.process_network_output_request(
    params_IN = request_json,
    debug_flag_IN = False
)

In [None]:
# Create network data without render details.
request_json[ ContextTextBase.PARAM_NETWORK_INCLUDE_RENDER_DETAILS ] = ContextTextBase.CHOICE_NO
network_outputter = NetworkOutput()
network_data = network_outputter.process_network_output_request(
    params_IN = request_json,
    debug_flag_IN = False
)

In [None]:
# create a hash of the data, for comparison
network_data_hash = make_string_hash( network_data )
print( "Network data hash: {}".format( network_data_hash ) )

# match?
should_be = "f85a48630c029f848bbb815d003b188eff38346b8eac0da2d55b7b224b323ac5"
if ( network_data_hash != should_be ):
    
    # not right hash. Error.
    print( "ERROR! network data hash is {}, should be {}".format( network_data_hash, should_be ) )
    
else:
    
    # a match
    print( "MATCH - network data hash {} matches expected. hooray!".format( network_data_hash ) )
    
#-- END debug/test --#

In [None]:
network_data_length = len( network_data )
should_be = 13448
print( "Network data length: {}".format( network_data_length ) )
if ( network_data_length != should_be ):
    
    # not right length. Error.
    print( "ERROR! network data length is {}, should be {}".format( network_data_length, should_be ) )
    
else:
    
    # a match
    print( "MATCH - string len()gth of {} matches expected. hooray!".format( network_data_length ) )
    
#-- END debug/test --#

In [None]:
# look at master person dict
master_person_dict = network_outputter.create_person_dict( load_person_IN = True )

# how many entries?
person_count = len( master_person_dict )
print( "- person count: {person_count}".format( person_count = person_count ) )

# right number?
should_be = 72
if ( person_count != should_be ):
    
    # not right length. Error.
    print( "ERROR! person count is {}, should be {}".format( person_count, should_be ) )
    
else:
    
    # a match
    print( "MATCH - person count of {} matches expected. hooray!".format( person_count ) )
    
#-- END debug/test --#

# persons 1049, 752 should not be present.
find_person_list = list()
find_person_list.append( 1049 )
find_person_list.append( 752 )
for find_person_id in find_person_list:

    if ( find_person_id in master_person_dict ):
    
        print( "ERROR - single-name person {} is in dictionary".format( find_person_id ) )
    
    else:
    
        print( "SUCCESS - single-name person {} not in dictionary".format( find_person_id ) )
    
    #-- END check for person --#

#-- END loop over persons to find. --#

# output all persons.
for person_id, person_instance in master_person_dict.items():
    
    print( "\n==> Person {person_id}: {person_instance}".format( person_id = person_id, person_instance = person_instance ) )
    
#-- END loop over persons --#

## network data from "export" unit test data - exclude tag `from_press_release`

- See [`context_text` github README](https://github.com/jonathanmorgan/context_text#test-data) for more details on loading this data.

Tag `from_press_release` added to the following `Article_Subject` instances:

- 740 - granholm (person 102)
- 637 - Mark Meadows (person 224)
- 677 - Gary Nelund (person 261)


In [None]:
#include_single_word_names = "yes"
include_single_word_names = "no"

request_json_string = """{
  "coders": "7",
  "end_date": "2010-02-13",
  "tags_list": "",
  "date_range": "",
  "start_date": "2009-12-07",
  "output_type": "tab_delimited_matrix",
  "publications": "1",
  "network_label": "exclude_from_press_release",
  "person_coders": "7",
  "database_output": "yes",
  "person_end_date": "2010-02-13",
  "person_tag_list": "",
  "coder_types_list": "OpenCalais_REST_API_v2",
  "person_date_range": "",
  "person_query_type": "custom",
  "person_start_date": "2009-12-07",
  "unique_identifiers": "",
  "person_publications": "1",
  "coder_id_priority_list": "",
  "coder_type_filter_type": "automated",
  "network_include_headers": "no",
  "person_coder_types_list": "OpenCalais_REST_API_v2",
  "allow_duplicate_articles": "no",
  "network_data_output_type": "net_and_attr_cols",
  "network_download_as_file": "no",
  "person_unique_identifiers": "",
  "include_source_contact_types": "direct,event,past_quotes,document,other",
  "person_coder_id_priority_list": "",
  "person_coder_type_filter_type": "automated",
  "network_include_render_details": "no",
  "person_allow_duplicate_articles": "no",
  "exclude_persons_with_tags_in_list": "from_press_release",
  "include_persons_with_single_word_name": "yes"
}"""
request_json = json.loads( request_json_string )
print( request_json ) 

In [None]:
# Create network data with render details.
request_json[ ContextTextBase.PARAM_NETWORK_INCLUDE_RENDER_DETAILS ] = ContextTextBase.CHOICE_YES
network_outputter = NetworkOutput()
network_data = network_outputter.process_network_output_request(
    params_IN = request_json,
    debug_flag_IN = False
)

In [None]:
# Create network data without render details.
request_json[ ContextTextBase.PARAM_NETWORK_INCLUDE_RENDER_DETAILS ] = ContextTextBase.CHOICE_NO
network_outputter = NetworkOutput()
network_data = network_outputter.process_network_output_request(
    params_IN = request_json,
    debug_flag_IN = False
)

In [None]:
# create a hash of the data, for comparison
network_data_hash = make_string_hash( network_data )
print( "Network data hash: {}".format( network_data_hash ) )

# match?
should_be = "3529e49830a8464cc0d8a497345b56404c73b867b1046fb38df346953a9b3b72"
if ( network_data_hash != should_be ):
    
    # not right hash. Error.
    print( "ERROR! network data hash is {}, should be {}".format( network_data_hash, should_be ) )
    
else:
    
    # a match
    print( "MATCH - network data hash {} matches expected. hooray!".format( network_data_hash ) )
    
#-- END debug/test --#

In [None]:
network_data_length = len( network_data )
should_be = 13122
print( "Network data length: {}".format( network_data_length ) )
if ( network_data_length != should_be ):
    
    # not right length. Error.
    print( "ERROR! network data length is {}, should be {}".format( network_data_length, should_be ) )
    
else:
    
    # a match
    print( "MATCH - string len()gth of {} matches expected. hooray!".format( network_data_length ) )
    
#-- END debug/test --#

In [None]:
# look at master person dict
master_person_dict = network_outputter.create_person_dict( load_person_IN = True )

# how many entries?
person_count = len( master_person_dict )
print( "- person count: {person_count}".format( person_count = person_count ) )

# right number?
should_be = 71
if ( person_count != should_be ):
    
    # not right length. Error.
    print( "ERROR! person count is {}, should be {}".format( person_count, should_be ) )
    
else:
    
    # a match
    print( "MATCH - person count of {} matches expected. hooray!".format( person_count ) )
    
#-- END debug/test --#

# persons 102, 224, 261 should not be present.
find_person_list = list()
find_person_list.append( 102 )
find_person_list.append( 224 )
find_person_list.append( 261 )
for find_person_id in find_person_list:

    if ( find_person_id in master_person_dict ):
    
        print( "ERROR - single-name person {} is in dictionary".format( find_person_id ) )
    
    else:
    
        print( "SUCCESS - single-name person {} not in dictionary".format( find_person_id ) )
    
    #-- END check for person --#

#-- END loop over persons to find. --#

# output all persons.
for person_id, person_instance in master_person_dict.items():
    
    print( "\n==> Person {person_id}: {person_instance}".format( person_id = person_id, person_instance = person_instance ) )
    
#-- END loop over persons --#

## network data from "export" unit test data - exclude tag `godwin_heights`

- See [`context_text` github README](https://github.com/jonathanmorgan/context_text#test-data) for more details on loading this data.

Tag `godwin_heights` added to the following `Article_Subject` instances:

- 623 - Felske, Jon (person 188)
- 622 - Johnston, Allen E. (person 187)
- 621 - Hornecker, Kenneth (person 189)


In [None]:
#include_single_word_names = "yes"
include_single_word_names = "no"

request_json_string = """{
  "coders": "7",
  "end_date": "2010-02-13",
  "tags_list": "",
  "date_range": "",
  "start_date": "2009-12-07",
  "output_type": "tab_delimited_matrix",
  "publications": "1",
  "network_label": "exclude_godwin_heights",
  "person_coders": "7",
  "database_output": "yes",
  "person_end_date": "2010-02-13",
  "person_tag_list": "",
  "coder_types_list": "OpenCalais_REST_API_v2",
  "person_date_range": "",
  "person_query_type": "custom",
  "person_start_date": "2009-12-07",
  "unique_identifiers": "",
  "person_publications": "1",
  "coder_id_priority_list": "",
  "coder_type_filter_type": "automated",
  "network_include_headers": "no",
  "person_coder_types_list": "OpenCalais_REST_API_v2",
  "allow_duplicate_articles": "no",
  "network_data_output_type": "net_and_attr_cols",
  "network_download_as_file": "no",
  "person_unique_identifiers": "",
  "include_source_contact_types": "direct,event,past_quotes,document,other",
  "person_coder_id_priority_list": "",
  "person_coder_type_filter_type": "automated",
  "network_include_render_details": "no",
  "person_allow_duplicate_articles": "no",
  "exclude_persons_with_tags_in_list": "godwin_heights",
  "include_persons_with_single_word_name": "yes"
}"""
request_json = json.loads( request_json_string )
print( request_json ) 

In [None]:
# Create network data with render details.
request_json[ ContextTextBase.PARAM_NETWORK_INCLUDE_RENDER_DETAILS ] = ContextTextBase.CHOICE_YES
network_outputter = NetworkOutput()
network_data = network_outputter.process_network_output_request(
    params_IN = request_json,
    debug_flag_IN = False
)

In [None]:
# Create network data without render details.
request_json[ ContextTextBase.PARAM_NETWORK_INCLUDE_RENDER_DETAILS ] = ContextTextBase.CHOICE_NO
network_outputter = NetworkOutput()
network_data = network_outputter.process_network_output_request(
    params_IN = request_json,
    debug_flag_IN = False
)

In [None]:
# create a hash of the data, for comparison
network_data_hash = make_string_hash( network_data )
print( "Network data hash: {}".format( network_data_hash ) )

# match?
should_be = "59e1b6ba6aab28cf37fcb45877d8cdd86d8593df9fa506352d0abd1b6fd3c29b"
if ( network_data_hash != should_be ):
    
    # not right hash. Error.
    print( "ERROR! network data hash is {}, should be {}".format( network_data_hash, should_be ) )
    
else:
    
    # a match
    print( "MATCH - network data hash {} matches expected. hooray!".format( network_data_hash ) )
    
#-- END debug/test --#

In [None]:
network_data_length = len( network_data )
should_be = 13122
print( "Network data length: {}".format( network_data_length ) )
if ( network_data_length != should_be ):
    
    # not right length. Error.
    print( "ERROR! network data length is {}, should be {}".format( network_data_length, should_be ) )
    
else:
    
    # a match
    print( "MATCH - string len()gth of {} matches expected. hooray!".format( network_data_length ) )
    
#-- END debug/test --#

In [None]:
# look at master person dict
master_person_dict = network_outputter.create_person_dict( load_person_IN = True )

# how many entries?
person_count = len( master_person_dict )
print( "- person count: {person_count}".format( person_count = person_count ) )

# right number?
should_be = 71
if ( person_count != should_be ):
    
    # not right length. Error.
    print( "ERROR! person count is {}, should be {}".format( person_count, should_be ) )
    
else:
    
    # a match
    print( "MATCH - person count of {} matches expected. hooray!".format( person_count ) )
    
#-- END debug/test --#

# persons 187, 188, 189 should not be present.
find_person_list = list()
find_person_list.append( 187 )
find_person_list.append( 188 )
find_person_list.append( 189 )
for find_person_id in find_person_list:

    if ( find_person_id in master_person_dict ):
    
        print( "ERROR - single-name person {} is in dictionary".format( find_person_id ) )
    
    else:
    
        print( "SUCCESS - single-name person {} not in dictionary".format( find_person_id ) )
    
    #-- END check for person --#

#-- END loop over persons to find. --#

# output all persons.
for person_id, person_instance in master_person_dict.items():
    
    print( "\n==> Person {person_id}: {person_instance}".format( person_id = person_id, person_instance = person_instance ) )
    
#-- END loop over persons --#

## network data from "export" unit test data - exclude tags `from_press_release` and `godwin_heights`

- See [`context_text` github README](https://github.com/jonathanmorgan/context_text#test-data) for more details on loading this data.

Tag `from_press_release` added to the following `Article_Subject` instances:

- 740 - granholm (person 102)
- 637 - Mark Meadows (person 224)
- 677 - Gary Nelund (person 261)

Tag `godwin_heights` added to the following `Article_Subject` instances:

- 623 - Felske, Jon (person 188)
- 622 - Johnston, Allen E. (person 187)
- 621 - Hornecker, Kenneth (person 189)


In [None]:
#include_single_word_names = "yes"
include_single_word_names = "no"

request_json_string = """{
  "coders": "7",
  "end_date": "2010-02-13",
  "tags_list": "",
  "date_range": "",
  "start_date": "2009-12-07",
  "output_type": "tab_delimited_matrix",
  "publications": "1",
  "network_label": "exclude_two_tags",
  "person_coders": "7",
  "database_output": "yes",
  "person_end_date": "2010-02-13",
  "person_tag_list": "",
  "coder_types_list": "OpenCalais_REST_API_v2",
  "person_date_range": "",
  "person_query_type": "custom",
  "person_start_date": "2009-12-07",
  "unique_identifiers": "",
  "person_publications": "1",
  "coder_id_priority_list": "",
  "coder_type_filter_type": "automated",
  "network_include_headers": "no",
  "person_coder_types_list": "OpenCalais_REST_API_v2",
  "allow_duplicate_articles": "no",
  "network_data_output_type": "net_and_attr_cols",
  "network_download_as_file": "no",
  "person_unique_identifiers": "",
  "include_source_contact_types": "direct,event,past_quotes,document,other",
  "person_coder_id_priority_list": "",
  "person_coder_type_filter_type": "automated",
  "network_include_render_details": "no",
  "person_allow_duplicate_articles": "no",
  "exclude_persons_with_tags_in_list": "from_press_release,godwin_heights",
  "include_persons_with_single_word_name": "yes"
}"""
request_json = json.loads( request_json_string )
print( request_json ) 

In [None]:
# Create network data with render details.
request_json[ ContextTextBase.PARAM_NETWORK_INCLUDE_RENDER_DETAILS ] = ContextTextBase.CHOICE_YES
network_outputter = NetworkOutput()
network_data = network_outputter.process_network_output_request(
    params_IN = request_json,
    debug_flag_IN = False
)

In [None]:
# Create network data without render details.
request_json[ ContextTextBase.PARAM_NETWORK_INCLUDE_RENDER_DETAILS ] = ContextTextBase.CHOICE_NO
network_outputter = NetworkOutput()
network_data = network_outputter.process_network_output_request(
    params_IN = request_json,
    debug_flag_IN = False
)

In [None]:
# create a hash of the data, for comparison
network_data_hash = make_string_hash( network_data )
print( "Network data hash: {}".format( network_data_hash ) )

# match?
should_be = "441127876c15eda7fb6cbf64e8555e011a2f459ba64b7111ac3dd4cbcdafbb2a"
if ( network_data_hash != should_be ):
    
    # not right hash. Error.
    print( "ERROR! network data hash is {}, should be {}".format( network_data_hash, should_be ) )
    
else:
    
    # a match
    print( "MATCH - network data hash {} matches expected. hooray!".format( network_data_hash ) )
    
#-- END debug/test --#

In [None]:
network_data_length = len( network_data )
should_be = 12159
print( "Network data length: {}".format( network_data_length ) )
if ( network_data_length != should_be ):
    
    # not right length. Error.
    print( "ERROR! network data length is {}, should be {}".format( network_data_length, should_be ) )
    
else:
    
    # a match
    print( "MATCH - string len()gth of {} matches expected. hooray!".format( network_data_length ) )
    
#-- END debug/test --#

In [None]:
# look at master person dict
master_person_dict = network_outputter.create_person_dict( load_person_IN = True )

# how many entries?
person_count = len( master_person_dict )
print( "- person count: {person_count}".format( person_count = person_count ) )

# right number?
should_be = 68
if ( person_count != should_be ):
    
    # not right length. Error.
    print( "ERROR! person count is {}, should be {}".format( person_count, should_be ) )
    
else:
    
    # a match
    print( "MATCH - person count of {} matches expected. hooray!".format( person_count ) )
    
#-- END debug/test --#

# persons 102, 224, 261, 187, 188, 189 should not be present.
find_person_list = list()
find_person_list.append( 102 )
find_person_list.append( 224 )
find_person_list.append( 261 )
find_person_list.append( 187 )
find_person_list.append( 188 )
find_person_list.append( 189 )
for find_person_id in find_person_list:

    if ( find_person_id in master_person_dict ):
    
        print( "ERROR - single-name person {} is in dictionary".format( find_person_id ) )
    
    else:
    
        print( "SUCCESS - single-name person {} not in dictionary".format( find_person_id ) )
    
    #-- END check for person --#

#-- END loop over persons to find. --#

# output all persons.
for person_id, person_instance in master_person_dict.items():
    
    print( "\n==> Person {person_id}: {person_instance}".format( person_id = person_id, person_instance = person_instance ) )
    
#-- END loop over persons --#

## network data from "export" unit test data - no single names, exclude tags `from_press_release` and `godwin_heights`

- See [`context_text` github README](https://github.com/jonathanmorgan/context_text#test-data) for more details on loading this data.

Tag `from_press_release` added to the following `Article_Subject` instances:

- 740 - granholm (person 102)
- 637 - Mark Meadows (person 224)
- 677 - Gary Nelund (person 261)

Tag `godwin_heights` added to the following `Article_Subject` instances:

- 623 - Felske, Jon (person 188)
- 622 - Johnston, Allen E. (person 187)
- 621 - Hornecker, Kenneth (person 189)


In [None]:
#include_single_word_names = "yes"
include_single_word_names = "no"

request_json_string = """{
  "coders": "7",
  "end_date": "2010-02-13",
  "tags_list": "",
  "date_range": "",
  "start_date": "2009-12-07",
  "output_type": "tab_delimited_matrix",
  "publications": "1",
  "network_label": "exclude_two_tags_and_single_names",
  "person_coders": "7",
  "database_output": "yes",
  "person_end_date": "2010-02-13",
  "person_tag_list": "",
  "coder_types_list": "OpenCalais_REST_API_v2",
  "person_date_range": "",
  "person_query_type": "custom",
  "person_start_date": "2009-12-07",
  "unique_identifiers": "",
  "person_publications": "1",
  "coder_id_priority_list": "",
  "coder_type_filter_type": "automated",
  "network_include_headers": "no",
  "person_coder_types_list": "OpenCalais_REST_API_v2",
  "allow_duplicate_articles": "no",
  "network_data_output_type": "net_and_attr_cols",
  "network_download_as_file": "no",
  "person_unique_identifiers": "",
  "include_source_contact_types": "direct,event,past_quotes,document,other",
  "person_coder_id_priority_list": "",
  "person_coder_type_filter_type": "automated",
  "network_include_render_details": "no",
  "person_allow_duplicate_articles": "no",
  "exclude_persons_with_tags_in_list": "from_press_release,godwin_heights",
  "include_persons_with_single_word_name": "no"
}"""
request_json = json.loads( request_json_string )
print( request_json ) 

In [None]:
# Create network data with render details.
request_json[ ContextTextBase.PARAM_NETWORK_INCLUDE_RENDER_DETAILS ] = ContextTextBase.CHOICE_YES
network_outputter = NetworkOutput()
network_data = network_outputter.process_network_output_request(
    params_IN = request_json,
    debug_flag_IN = False
)

In [None]:
# Create network data without render details.
request_json[ ContextTextBase.PARAM_NETWORK_INCLUDE_RENDER_DETAILS ] = ContextTextBase.CHOICE_NO
network_outputter = NetworkOutput()
network_data = network_outputter.process_network_output_request(
    params_IN = request_json,
    debug_flag_IN = False
)

In [None]:
# create a hash of the data, for comparison
network_data_hash = make_string_hash( network_data )
print( "Network data hash: {}".format( network_data_hash ) )

# match?
should_be = "0f8a530f18a724b3d724d7fe9caa3082954c049abdc02b77bc480fc432d0a770"
if ( network_data_hash != should_be ):
    
    # not right hash. Error.
    print( "ERROR! network data hash is {}, should be {}".format( network_data_hash, should_be ) )
    
else:
    
    # a match
    print( "MATCH - network data hash {} matches expected. hooray!".format( network_data_hash ) )
    
#-- END debug/test --#

In [None]:
network_data_length = len( network_data )
should_be = 11534
print( "Network data length: {}".format( network_data_length ) )
if ( network_data_length != should_be ):
    
    # not right length. Error.
    print( "ERROR! network data length is {}, should be {}".format( network_data_length, should_be ) )
    
else:
    
    # a match
    print( "MATCH - string len()gth of {} matches expected. hooray!".format( network_data_length ) )
    
#-- END debug/test --#

In [None]:
# look at master person dict
master_person_dict = network_outputter.create_person_dict( load_person_IN = True )

# how many entries?
person_count = len( master_person_dict )
print( "- person count: {person_count}".format( person_count = person_count ) )

# right number?
should_be = 66
if ( person_count != should_be ):
    
    # not right length. Error.
    print( "ERROR! person count is {}, should be {}".format( person_count, should_be ) )
    
else:
    
    # a match
    print( "MATCH - person count of {} matches expected. hooray!".format( person_count ) )
    
#-- END debug/test --#

# the following persons should not be present
find_person_list = list()

# 1049, 752 (single names)
find_person_list.append( 1049 )
find_person_list.append( 752 )

# 102, 224, 261 (tag `from_press_release`)
find_person_list.append( 102 )
find_person_list.append( 224 )
find_person_list.append( 261 )

# 187, 188, 189 (tag `godwin_heights`)
find_person_list.append( 187 )
find_person_list.append( 188 )
find_person_list.append( 189 )

# check for people who should have been removed.
for find_person_id in find_person_list:

    if ( find_person_id in master_person_dict ):
    
        print( "ERROR - single-name person {} is in dictionary".format( find_person_id ) )
    
    else:
    
        print( "SUCCESS - single-name person {} not in dictionary".format( find_person_id ) )
    
    #-- END check for person --#

#-- END loop over persons to find. --#

# output all persons.
for person_id, person_instance in master_person_dict.items():
    
    print( "\n==> Person {person_id}: {person_instance}".format( person_id = person_id, person_instance = person_instance ) )
    
#-- END loop over persons --#

# Export coded data to new fixtures

- Back to [Table of Contents](#Table-of-Contents)

Once you have coded articles, you'll want to re-export the coded data to fixtures.

Export them to JSON fixture files using manage.py / django-admin dumpdata ( [https://docs.djangoproject.com/en/dev/ref/django-admin/#django-admin-dumpdata](https://docs.djangoproject.com/en/dev/ref/django-admin/#django-admin-dumpdata) ) so they can be imported using python manage.py or django-admin loaddata ( [https://docs.djangoproject.com/en/dev/ref/django-admin/#django-admin-loaddata](https://docs.djangoproject.com/en/dev/ref/django-admin/#django-admin-loaddata) ) rather than having to input them in the admin:

    python manage.py dumpdata [app_label[.ModelName] [app_label[.ModelName] ...]] --indent INDENT --output <output_file_path>
    
    Example: 
    
    python manage.py dumpdata \
        --indent 4 \
        --output context-sourcenet_entities_and_relations.json \
        context.Entity_Identifier_Type \
        context.Entity_Relation_Type \
        context.Entity_Relation_Type_Trait \
        context.Entity_Type \
        context.Entity_Type_Trait \
        context.Trait_Type \
        context.Term \
        context.Term_Relation \
        context.Term_Relation_Type \
        context.Vocabulary \
        
    No line breaks:
    
        python manage.py dumpdata --indent 4 --output context-sourcenet_entities_and_relations.json context.Entity_Identifier_Type context.Entity_Relation_Type context.Entity_Relation_Type_Trait context.Entity_Type context.Entity_Type_Trait context.Trait_Type context.Term context.Term_Relation context.Term_Relation_Type context.Vocabulary

The changes we've made here are in three applications: `context_text`, and `taggit`.  To make a new fixture for each:

- `python manage.py dumpdata --indent 4 --output context_text_unittest_export_data.json context_text`
- `python manage.py dumpdata --indent 4 --output context_text_unittest_export_taggit_data.json taggit`

_Note: I included the `article_data_notes` (OpenCalais RAW XML) last time, keeping them in this time, too. If you wanted to leave them out:_

- `python manage.py dumpdata --indent 4 --output context_text_unittest_export_data.json --exclude context_text.article_data_notes context_text`

These are stored in the `context_text` github repo, in `context_text/fixtures`.