**data creation - filter locally implemented hard news**

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Setup---Imports" data-toc-modified-id="Setup---Imports-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Setup - Imports</a></span></li><li><span><a href="#Setup---virtualenv-jupyter-kernel" data-toc-modified-id="Setup---virtualenv-jupyter-kernel-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Setup - virtualenv jupyter kernel</a></span></li><li><span><a href="#Setup---Initialize-Django" data-toc-modified-id="Setup---Initialize-Django-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Setup - Initialize Django</a></span></li></ul></li><li><span><a href="#Analysis-steps" data-toc-modified-id="Analysis-steps-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Analysis steps</a></span></li><li><span><a href="#Data-Creation" data-toc-modified-id="Data-Creation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Creation</a></span><ul class="toc-item"><li><span><a href="#Data-Creation---local-GRP-articles" data-toc-modified-id="Data-Creation---local-GRP-articles-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Data Creation - local GRP articles</a></span><ul class="toc-item"><li><span><a href="#Data-Creation---GRP---examples" data-toc-modified-id="Data-Creation---GRP---examples-3.1.1"><span class="toc-item-num">3.1.1&nbsp;&nbsp;</span>Data Creation - GRP - examples</a></span></li><li><span><a href="#Data-Creation---GRP---prelim-month" data-toc-modified-id="Data-Creation---GRP---prelim-month-3.1.2"><span class="toc-item-num">3.1.2&nbsp;&nbsp;</span>Data Creation - GRP - prelim month</a></span></li></ul></li><li><span><a href="#Data-Creation---local-TDN-articles" data-toc-modified-id="Data-Creation---local-TDN-articles-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Data Creation - local TDN articles</a></span><ul class="toc-item"><li><span><a href="#Data-Creation---TDN---local-bylines" data-toc-modified-id="Data-Creation---TDN---local-bylines-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Data Creation - TDN - local bylines</a></span></li></ul></li></ul></li></ul></div>

# Setup

- Back to [Table of Contents](#Table-of-Contents)

## Setup - Imports

- Back to [Table of Contents](#Table-of-Contents)

In [12]:
import datetime
import six

# Django query object for OR-ing selection criteria together.
from django.db.models import Q

# imports - python_utilities
from python_utilities.logging.logging_helper import LoggingHelper

print( "packages imported at " + str( datetime.datetime.now() ) )

packages imported at 2020-02-28 20:53:46.336181


## Setup - virtualenv jupyter kernel

- Back to [Table of Contents](#Table-of-Contents)

If you are using a virtualenv, make sure that you:

- have installed your virtualenv as a kernel.
- choose the kernel for your virtualenv as the kernel for your notebook (Kernel --> Change kernel).

Since I use a virtualenv, need to get that activated somehow inside this notebook.  One option is to run `../dev/wsgi.py` in this notebook, to configure the python environment manually as if you had activated the `sourcenet` virtualenv.  To do this, you'd make a code cell that contains:

    %run ../dev/wsgi.py
    
This is sketchy, however, because of the changes it makes to your Python environment within the context of whatever your current kernel is.  I'd worry about collisions with the actual Python 3 kernel.  Better, one can install their virtualenv as a separate kernel.  Steps:

- activate your virtualenv:

        workon sourcenet

- in your virtualenv, install the package `ipykernel`.

        pip install ipykernel

- use the ipykernel python program to install the current environment as a kernel:

        python -m ipykernel install --user --name <env_name> --display-name "<display_name>"
        
    `sourcenet` example:
    
        python -m ipykernel install --user --name sourcenet --display-name "sourcenet (Python 3)"
        
More details: [http://ipython.readthedocs.io/en/stable/install/kernel_install.html](http://ipython.readthedocs.io/en/stable/install/kernel_install.html)

In [2]:
%pwd

'/home/jonathanmorgan/work/django/research/work/phd_work/methods/data_creation'

## Setup - Initialize Django

- Back to [Table of Contents](#Table-of-Contents)

First, initialize my dev django project, so I can run code in this notebook that references my django models and can talk to the database using my project's settings.

In [4]:
# init django
django_init_folder = "/home/jonathanmorgan/work/django/research/work/phd_work"
django_init_path = "django_init.py"
if( ( django_init_folder is not None ) and ( django_init_folder != "" ) ):
    
    # add folder to front of path.
    django_init_path = "{}/{}".format( django_init_folder, django_init_path )
    
#-- END check to see if django_init folder. --#

In [5]:
%run $django_init_path

django initialized at 2020-02-28 20:30:34.476100


In [13]:
# imports - context_text
from context_text.collectors.newsbank.newspapers.DTNB import DTNB
from context_text.models import Article
from context_text.models import Article_Text
from context_text.models import Newspaper


# Analysis steps

In the methods folder, here is the order the code was run for analysis:

- 1) `data_creation` (results in `data` folder)
- 2) `reliability` - for human reliability.
- 3) `evaluate_disagreements` - correct human coding when they are wrong in a disagreement, to create "ground truth".
- 4) `precision_recall`
- 5) `reliability` - for comparing human and computer coding to baseline.
- 6) `network_analysis`
- 7) `results`

# Data Creation

- back to [Table of Contents](#Table-of-Contents)

Below are the criteria used for each paper to filter down to just locally-implemented hard news articles.

## Data Creation - local GRP articles

- back to [Table of Contents](#Table-of-Contents)

Definition of local hard news and in-house implementor:

- Grand Rapids Press

    - `context_text/examples/articles/articles-GRP-local_news.py`
    - local hard news sections (stored in `Article.GRP_NEWS_SECTION_NAME_LIST`):
   
        - "Business"
        - "City and Region"
        - "Front Page"
        - "Lakeshore"
        - "Religion"
        - "Special"
        - "State"

    - excluding any publications with index term of "Column".
    - in-house implementor (based on byline patterns, stored in `sourcenet.models.Article.Q_GRP_IN_HOUSE_AUTHOR`):
    
        - Byline ends in "/ THE GRAND RAPIDS PRESS", ignore case.
        
            - `Q( author_varchar__iregex = r'.* */ *THE GRAND RAPIDS PRESS$'`

        - Byline ends in "/ PRESS * EDITOR", ignore case.
     
            - `Q( author_varchar__iregex = r'.* */ *PRESS .* EDITOR$' )`
        
        - Byline ends in "/ GRAND RAPIDS PRESS * BUREAU", ignore case.
        
            - `Q( author_varchar__iregex = r'.* */ *GRAND RAPIDS PRESS .* BUREAU$' )`
            
        - Byline ends in "/ SPECIAL TO THE PRESS", ignore case.
        
            - `Q( author_varchar__iregex = r'.* */ *SPECIAL TO THE PRESS$' )`

In [15]:
#==============================================================================#
# ! declare variables
#==============================================================================#

# declare variables - Grand Rapids Press
local_news_sections = []
selected_newspaper = None
article_qs = None
date_range_start = None
date_range_end = None
custom_article_q = None
article_count = None
tags_in_list = []
tags_not_in_list = []
random_count = -1

# declare variables - size of random sample we want
random_count = 0

# declare variables - process articles?
do_process_articles = False
do_apply_tag = False
tag_to_apply = None
tags_to_apply_list = None
article_id_list = None
article_counter = -1
current_article = None

#==============================================================================#
# ! configure filters
#==============================================================================#

# ==> Article.filter_articles()
# date range, newspaper, section list, and custom Q().
date_range_start = None
date_range_end = None
selected_newspaper = None
section_name_in_list = None
custom_article_q = None

# ==> date range

# month of local news from Detroit News from 2009-12-01 to 2009-12-31
#date_range_start = "2009-12-01"
#date_range_end = "2009-12-31"

# ==> newspaper
selected_newspaper = Newspaper.objects.get( id = 1 ) # Grand Rapids Press

# ==> limit to "local, regional and state news" sections.

#local_news_sections.append( "Lakeshore" )
#local_news_sections.append( "Front Page" )
#local_news_sections.append( "City and Region" )
#local_news_sections.append( "Business" )
#local_news_sections.append( "Religion" )
#local_news_sections.append( "State" )
#local_news_sections.append( "Special" )
local_news_sections = Article.GRP_NEWS_SECTION_NAME_LIST
section_name_in_list = local_news_sections

# ==> limit to staff reporters.

# use custom Article Q to filter down to in-house authors.
custom_article_q = Article.Q_GRP_IN_HOUSE_AUTHOR

# ==> article IDs - include
#article_id_in_list = None

# ==> tags - exclude
#tags_not_in_list = [ "prelim_reliability", "prelim_network" ]
tags_not_in_list = None

# ==> tags - include only those with certain tags.
#tags_in_list = [ "prelim_reliability", "prelim_network" ]
#tags_in_list = [ "prelim_reliability", ]
tags_in_list = None

# ==> filter out "*prelim*" tags?
filter_out_prelim_tags = False

# ==> ORDER BY - do we want a random sample?
#random_count = 10
random_count = -1

#==============================================================================#
# ! configure processing
#==============================================================================#

do_process_articles = False
do_apply_tag = False
tags_to_apply_list = []
#tags_to_apply_list = [ "locally_implemented_hard_news" ]

print( "Filter params:" )
print( "- date_range_start: {}".format( date_range_start ) )
print( "- date_range_end: {}".format( date_range_end ) )
print( "- newspaper: {}".format( newspaper ) )
print( "- local_news_sections: {}".format( local_news_sections ) )
print( "- custom_article_q: {}".format( custom_article_q ) )
print( "" )

#==============================================================================#
# ! Do it!
#==============================================================================#

# start with all articles
article_qs = Article.objects.all()

# filter to include date range, newspaper, section list, and in-house reporters.
article_qs = Article.filter_articles( qs_IN = article_qs,
                                      start_date = date_range_start,
                                      end_date = date_range_end,
                                      newspaper = selected_newspaper,
                                      section_name_list = section_name_in_list,
                                      custom_article_q = custom_article_q )
# no columns
article_qs = article_qs.exclude( index_terms__icontains = "Column" )

# how many is that?
article_count = article_qs.count()
print( "Article count before filtering on Article IDs: " + str( article_count ) )

# ! ==> article IDs in list
if ( ( article_id_in_list is not None ) and ( len( article_id_in_list ) > 0 ) ):

    # include those in a list
    print( "filtering articles to those with IDs: " + str( article_id_in_list ) )
    article_qs = article_qs.filter( id__in = article_id_in_list )

#-- END check to see if we have a specific list of tags we want to exclude --#

# how many is that?
article_count = article_qs.count()
print( "Article count before filtering on tags: " + str( article_count ) )

# ==> tags

# tags to exclude
if ( ( tags_not_in_list is not None ) and ( len( tags_not_in_list ) > 0 ) ):

    # exclude those in a list
    print( "exclude-ing articles with tags: " + str( tags_not_in_list ) )
    article_qs = article_qs.exclude( tags__name__in = tags_not_in_list )

#-- END check to see if we have a specific list of tags we want to exclude --#

# ! ==> tags - include only those with certain tags.
if ( ( tags_in_list is not None ) and ( len( tags_in_list ) > 0 ) ):

    # filter
    print( "filtering to just articles with tags: " + str( tags_in_list ) )
    article_qs = article_qs.filter( tags__name__in = tags_in_list )
    
#-- END check to see if we have a specific list of tags we want to include --#

# how many is that?
article_count = article_qs.count()

print( "Article count after tag filtering: " + str( article_count ) )

# do we want a random sample?
random_count = 0
if ( random_count > 0 ):

    # to get random, order them by "?", then use slicing to retrieve requested
    #     number.
    article_qs = article_qs.order_by( "?" )[ : random_count ]
    
#-- END check to see if we want random sample --#

# this is a nice algorithm, also:
# - http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/

# process articles?
if ( do_process_articles == True ):

    # make list of article IDs.

    article_id_list = []
    article_counter = 0
    for current_article in article_qs:

        # increment article_counter
        article_counter += 1

        # add IDs to article_id_list
        article_id_list.append( str( current_article.id ) )

        # ! ==> apply tags
        
        # apply tag(s) while we are at it?
        if ( ( do_apply_tag == True ) and ( tags_to_apply_list is not None ) and ( len( tags_to_apply_list ) > 0 ) ):
        
            print( "- Applying tags " + str( tags_to_apply_list ) + " to article.")

            # yes, please.  Loop over tags list.
            for tag_to_apply in tags_to_apply_list:
            
                # tag the article with each tag in the list.
                current_article.tags.add( tag_to_apply )
                
                print( "====> Applied tag \"" + tag_to_apply + "\"." )
                
            #-- END loop over tag list. --#
            
            print( "- Tags for article " + str( current_article.id ) + " : " + str( current_article.tags.all() ) )
            
        #-- END check to see if we apply tag. --#

        # output the tags.
        print( "- Tags for article " + str( current_article.id ) + " : " + str( current_article.tags.all() ) )

    #-- END loop over articles --#

    # output the list.
    print( "Found " + str( article_counter ) + " articles ( " + str( article_count ) + " )." )
    print( "List of " + str( len( article_id_list ) ) + " local GRP staff article IDs: " + ", ".join( article_id_list ) )
    
#-- END check to see if we process articles --#

Filter params:
- date_range_start: None
- date_range_end: None
- newspaper: 1 - Grand Rapids Press, The ( GRPB )
- local_news_sections: ['Business', 'City and Region', 'Front Page', 'Lakeshore', 'Religion', 'Special', 'State']
- custom_article_q: (OR: ('author_varchar__iregex', '.* */ *THE GRAND RAPIDS PRESS$'), ('author_varchar__iregex', '.* */ *PRESS .* EDITOR$'), ('author_varchar__iregex', '.* */ *GRAND RAPIDS PRESS .* BUREAU$'), ('author_varchar__iregex', '.* */ *SPECIAL TO THE PRESS$'))

Article count before filtering on Article IDs: 41107
Article count before filtering on tags: 41107
Article count after tag filtering: 41107


### Data Creation - GRP - examples

just 2009-12-01 through 2009-12-31:

    Filter params:
    - date_range_start: 2009-12-01
    - date_range_end: 2009-12-31
    - newspaper: 1 - Grand Rapids Press, The ( GRPB )
    - local_news_sections: ['Business', 'City and Region', 'Front Page', 'Lakeshore', 'Religion', 'Special', 'State']
    - custom_article_q: (OR: ('author_varchar__iregex', '.* */ *THE GRAND RAPIDS PRESS$'), ('author_varchar__iregex', '.* */ *PRESS .* EDITOR$'), ('author_varchar__iregex', '.* */ *GRAND RAPIDS PRESS .* BUREAU$'), ('author_varchar__iregex', '.* */ *SPECIAL TO THE PRESS$'))

    Article count before filtering on tags: 441
    Article count after tag filtering: 441
    
All articles:

    Filter params:
    - date_range_start: None
    - date_range_end: None
    - newspaper: 1 - Grand Rapids Press, The ( GRPB )
    - local_news_sections: ['Business', 'City and Region', 'Front Page', 'Lakeshore', 'Religion', 'Special', 'State']
    - custom_article_q: (OR: ('author_varchar__iregex', '.* */ *THE GRAND RAPIDS PRESS$'), ('author_varchar__iregex', '.* */ *PRESS .* EDITOR$'), ('author_varchar__iregex', '.* */ *GRAND RAPIDS PRESS .* BUREAU$'), ('author_varchar__iregex', '.* */ *SPECIAL TO THE PRESS$'))

    Article count before filtering on tags: 41107
    Article count after tag filtering: 41107

### Data Creation - GRP - prelim month

Description of month of GRP articles from December, 2009, for paper.

- grp_month article count = 441


In [5]:
from context_text.models import Article

In [None]:
# how many articles in "grp_month"?
article_qs = Article.objects.filter( tags__name__in = [ "grp_month" ] )
grp_month_count = article_qs.count()

print( "grp_month count = {}".format( grp_month_count ) )

## Data Creation - local TDN articles

- back to [Table of Contents](#Table-of-Contents)

Definition of local hard news and in-house implementor:

- Detroit News

    - `context_text/examples/articles/articles-TDN-local_news.py`
    - local hard news sections (stored in `from context_text.collectors.newsbank.newspapers.DTNB import DTNB` - `DTNB.NEWS_SECTION_NAME_LIST`):
   
        - "Business"
        - "Metro"
        - "Nation" - because of auto industry stories

    - in-house implementor (based on byline patterns, stored in `DTNB.Q_IN_HOUSE_AUTHOR`):
    
        - Byline ends in "/ The Detroit News", ignore case.
        
            - `Q( author_varchar__iregex = r'.*\s*/\s*the\s*detroit\s*news$' )`

        - Byline ends in "Special to The Detroit News", ignore case.
        
            - `Q( author_varchar__iregex = r'.*\s*/\s*special\s*to\s*the\s*detroit\s*news$' )`
        
        - Byline ends in "Detroit News * Bureau", ignore case.
        
            - `Q( author_varchar__iregex = r'.*\s*/\s*detroit\s*news\s*.*\s*bureau$' )`    

In [14]:
#==============================================================================#
# ! imports
#==============================================================================#

# Django query object for OR-ing selection criteria together.
from django.db.models import Q

# imports - python_utilities
from python_utilities.logging.logging_helper import LoggingHelper

# imports - context_text
from context_text.collectors.newsbank.newspapers.DTNB import DTNB
from context_text.models import Article
from context_text.models import Article_Text
from context_text.models import Newspaper

#==============================================================================#
# ! declare variables
#==============================================================================#
selected_newspaper = None
article_qs = None
article_count = -1
article_counter = -1

# declare variables - filtering
start_date = ""
end_date = ""
local_news_sections = []
section_name_in_list = []
custom_article_q = None
affiliation = ""
article_id_list = []
article_id_in_list = []
tags_in_list = []
tags_not_in_list = []
filter_out_prelim_tags = False
random_count = -1
limit_count = -1
article_counter = -1

# declare variables - capturing author info.
do_capture_author_info = False
processing_counter = -1
processing_id_list = []
processing_section_list = []

# declare variables - size of random sample we want
#random_count = 60

# declare variables - also, apply tag?
do_apply_tag = False
tags_to_apply_list = []

# declare variables - details on author string anomalies
anomaly_detail = {}
author_anomaly_article_id = -1
author_anomaly_author_string = ""
author_anomaly_graf_1 = ""
author_anomaly_graf_2 = ""
anomaly_detail_string = ""

#==============================================================================#
# ! configure filters
#==============================================================================#

# ==> Article.filter_articles()
# date range, newspaper, section list, and custom Q().
start_date = None
end_date = None
selected_newspaper = None
section_name_in_list = None
custom_article_q = None

# month of local news from Detroit News from 2009-12-01 to 2009-12-31
#start_date = "2009-12-01"
#end_date = "2009-12-31"
selected_newspaper = Newspaper.objects.get( id = 2 ) # Detroit News

# limit to "local, regional and state news" sections.
section_name_in_list = DTNB.NEWS_SECTION_NAME_LIST

# limit to staff reporters.
custom_article_q = DTNB.Q_IN_HOUSE_AUTHOR

# ==> article IDs - include
#article_id_in_list = None

# ==> tags - exclude
#tags_not_in_list = [ "prelim_reliability", "prelim_network" ]
tags_not_in_list = None

# ==> tags - include only those with certain tags.
#tags_in_list = [ "prelim_reliability", "prelim_network" ]
#tags_in_list = [ "prelim_reliability", ]
tags_in_list = None

# ==> filter out "*prelim*" tags?
filter_out_prelim_tags = False

# ==> ORDER BY - do we want a random sample?
#random_count = 10
random_count = -1

#==============================================================================#
# ! configure processing
#==============================================================================#

do_capture_author_info = False
do_apply_tag = False
tags_to_apply_list = []
#tags_to_apply_list = [ "locally_implemented_hard_news" ]

print( "Filter params:" )
print( "- date_range_start: {}".format( date_range_start ) )
print( "- date_range_end: {}".format( date_range_end ) )
print( "- newspaper: {}".format( newspaper ) )
print( "- local_news_sections: {}".format( local_news_sections ) )
print( "- custom_article_q: {}".format( custom_article_q ) )
print( "" )

#==============================================================================#
# ! filtering
#==============================================================================#

# start with all articles
article_qs = Article.objects.all()

# ! ==> call Article.filter_articles()
article_qs = Article.filter_articles( qs_IN = article_qs,
                                      start_date = start_date,
                                      end_date = end_date,
                                      newspaper = selected_newspaper,
                                      section_name_list = section_name_in_list,
                                      custom_article_q = custom_article_q )

# how many is that?
article_count = article_qs.count()
print( "Article count before filtering on Article IDs: " + str( article_count ) )

# ! ==> article IDs in list
if ( ( article_id_in_list is not None ) and ( len( article_id_in_list ) > 0 ) ):

    # include those in a list
    print( "filtering articles to those with IDs: " + str( article_id_in_list ) )
    article_qs = article_qs.filter( id__in = article_id_in_list )

#-- END check to see if we have a specific list of tags we want to exclude --#

# how many is that?
article_count = article_qs.count()
print( "Article count before filtering on tags: " + str( article_count ) )

# ! ==> tags - exclude

# tags to exclude
if ( ( tags_not_in_list is not None ) and ( len( tags_not_in_list ) > 0 ) ):

    # exclude those in a list
    print( "exclude-ing articles with tags: " + str( tags_not_in_list ) )
    article_qs = article_qs.exclude( tags__name__in = tags_not_in_list )

#-- END check to see if we have a specific list of tags we want to exclude --#

# ! ==> tags - include only those with certain tags.
if ( ( tags_in_list is not None ) and ( len( tags_in_list ) > 0 ) ):

    # filter
    print( "filtering to just articles with tags: " + str( tags_in_list ) )
    article_qs = article_qs.filter( tags__name__in = tags_in_list )
    
#-- END check to see if we have a specific list of tags we want to include --#

# ! ==> filter out "*prelim*" tags?
if ( filter_out_prelim_tags == True ):

    # ifilter out all articles with any tag whose name contains "prelim".
    print( "filtering out articles with tags that contain \"prelim\"" )
    article_qs = article_qs.exclude( tags__name__icontains = "prelim" )
    
#-- END check to see if we filter out "prelim_*" tags --#

# how many is that?
article_count = article_qs.count()

print( "Article count after tag filtering: " + str( article_count ) )

# just want un-cleaned-up:
article_qs = article_qs.filter( Q( cleanup_status = Article.CLEANUP_STATUS_NEW ) | Q( cleanup_status__isnull = True ) )
article_count = article_qs.count()

print( "Article count after filtering on cleanup_status - filter out any that have already been cleaned up: " + str( article_count ) )

# ! ==> ORDER BY - do we want a random sample?
#random_count = 10
if ( random_count > 0 ):

    # to get random, order them by "?", then use slicing to retrieve requested
    #     number.
    article_qs = article_qs.order_by( "?" )[ : random_count ]

else:

    # order by ID (can't re-order once a slice is taken, so can't re-order if
    #     random sample).
    article_qs = article_qs.order_by( "id" )

#-- END check to see if we want random sample --#

# this is a nice algorithm, also:
# - http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/

# how many is that?
article_count = article_qs.count()
print( "Article count after ORDER-ing: " + str( article_count ) )

# ! ==> LIMIT
# limit_count = 1
if ( limit_count > 0 ):

    article_qs = article_qs[ : limit_count ]
    
#-- END check to see if we limit. --#

# how many is that?
article_count = article_qs.count()
print( "Article count after LIMIT-ing: " + str( article_count ) )

#==============================================================================#
# ! processing - apply tags, capture author info
#==============================================================================#

# loop over articles?
if ( ( do_apply_tag == True ) or ( do_capture_author_info == True ) ):

    # capture author information
    processing_counter = 0
    processing_id_list = []
    processing_section_list = []
    for current_article in article_qs:
    
        # increment counter
        processing_counter += 1
        
        # update auditing information
        processing_id_list.append( current_article.id )
        if ( current_article.section not in processing_section_list ):

            # add hitherto unseen section to list.
            processing_section_list.append( current_article.section )
            
        #-- END check to see if we've already captured this section. --#
        
        print( "\nArticle " + str( processing_counter ) + " of " + str( article_count ) + ": " + str( current_article ) )

        # ! ==> capture_author_info()
        
        # are we capturing author info?
        if ( do_capture_author_info ):
    
            print( "- Attempting to capture author information from body of article.")
    
            # call capture_author_info()
            capture_status = tdn.capture_author_info( current_article )
            capture_status_list = capture_status.get_message_list()
            
            # output status list
            print( "====> " + str( processing_counter ) + " - Article ID: " + str( current_article.id ) + "; capture_status = " + str( capture_status ) )
            
        #-- END check to see if we try to capture author info --#
        
        # ! ==> apply tags
        
        # apply tag(s) while we are at it?
        if ( ( do_apply_tag == True ) and ( tags_to_apply_list is not None ) and ( len( tags_to_apply_list ) > 0 ) ):
        
            print( "- Applying tags " + str( tags_to_apply_list ) + " to article.")

            # yes, please.  Loop over tags list.
            for tag_to_apply in tags_to_apply_list:
            
                # tag the article with each tag in the list.
                current_article.tags.add( tag_to_apply )
                
                print( "====> Applied tag \"" + tag_to_apply + "\"." )
                
            #-- END loop over tag list. --#
            
            print( "- Tags for article " + str( current_article.id ) + " : " + str( current_article.tags.all() ) )
            
        #-- END check to see if we apply tag. --#
        
    #-- END loop over articles to capture author info. --#
    
    print( "\n" )
    print( "Processed " + str( processing_counter ) + " filtered article IDs: " + str( sorted( processing_id_list ) ) + " in the following sections of the paper: " + str( processing_section_list ) )

#-- END check to see if there is work to do. --#

Article count before filtering on Article IDs: 13070
Article count before filtering on tags: 13070
Article count after tag filtering: 13070
Article count after filtering on cleanup_status - filter out any that have already been cleaned up: 13036
Article count after ORDER-ing: 13036
Article count after LIMIT-ing: 13036


### Data Creation - TDN - local bylines

In [None]:
#==============================================================================#
# ! imports
#==============================================================================#

# Django query object for OR-ing selection criteria together.
from django.db.models import Q

# imports - python_utilities
from python_utilities.logging.logging_helper import LoggingHelper

# imports - context_text
from context_text.collectors.newsbank.newspapers.DTNB import DTNB
from context_text.models import Article
from context_text.models import Article_Text
from context_text.models import Newspaper

#==============================================================================#
# ! declare variables
#==============================================================================#
selected_newspaper = None
article_qs = None
article_count = -1
article_counter = -1

# declare variables - filtering
start_date = ""
end_date = ""
local_news_sections = []
section_name_in_list = []
custom_article_q = None
affiliation = ""
article_id_list = []
article_id_in_list = []
tags_in_list = []
tags_not_in_list = []
filter_out_prelim_tags = False
random_count = -1
limit_count = -1
article_counter = -1

# declare variables - capturing author info.
do_capture_author_info = False
processing_counter = -1
processing_id_list = []
processing_section_list = []

# declare variables - size of random sample we want
#random_count = 60

# declare variables - also, apply tag?
do_apply_tag = False
tags_to_apply_list = []

# declare variables - details on author string anomalies
anomaly_detail = {}
author_anomaly_article_id = -1
author_anomaly_author_string = ""
author_anomaly_graf_1 = ""
author_anomaly_graf_2 = ""
anomaly_detail_string = ""

#==============================================================================#
# ! configure filters
#==============================================================================#

# ==> Article.filter_articles()
# date range, newspaper, section list, and custom Q().
start_date = ""
end_date = ""
selected_newspaper = None
section_name_in_list = None
custom_article_q = None

# month of local news from Detroit News from 2009-12-01 to 2009-12-31
#start_date = "2009-12-01"
#end_date = "2009-12-31"
selected_newspaper = Newspaper.objects.get( id = 2 ) # Detroit News

# limit to "local, regional and state news" sections.
section_name_in_list = DTNB.NEWS_SECTION_NAME_LIST

# limit to staff reporters.
custom_article_q = DTNB.Q_IN_HOUSE_AUTHOR

# ==> article IDs - include
#article_id_in_list = None

# ==> tags - exclude
#tags_not_in_list = [ "prelim_reliability", "prelim_network" ]
tags_not_in_list = None

# ==> tags - include only those with certain tags.
#tags_in_list = [ "prelim_reliability", "prelim_network" ]
#tags_in_list = [ "prelim_reliability", ]
tags_in_list = None

# ==> filter out "*prelim*" tags?
filter_out_prelim_tags = False

# ==> ORDER BY - do we want a random sample?
#random_count = 10
random_count = -1

#==============================================================================#
# ! configure processing
#==============================================================================#

do_capture_author_info = False
do_apply_tag = False
tags_to_apply_list = []
#tags_to_apply_list = [ "locally_implemented_hard_news" ]

#==============================================================================#
# ! filtering
#==============================================================================#

# start with all articles
article_qs = Article.objects.all()

# ! ==> call Article.filter_articles()
article_qs = Article.filter_articles( qs_IN = article_qs,
                                      start_date = start_date,
                                      end_date = end_date,
                                      newspaper = selected_newspaper,
                                      section_name_list = section_name_in_list,
                                      custom_article_q = custom_article_q )

# how many is that?
article_count = article_qs.count()
print( "Article count before filtering on Article IDs: " + str( article_count ) )

# ! ==> article IDs in list
if ( ( article_id_in_list is not None ) and ( len( article_id_in_list ) > 0 ) ):

    # include those in a list
    print( "filtering articles to those with IDs: " + str( article_id_in_list ) )
    article_qs = article_qs.filter( id__in = article_id_in_list )

#-- END check to see if we have a specific list of tags we want to exclude --#

# how many is that?
article_count = article_qs.count()
print( "Article count before filtering on tags: " + str( article_count ) )

# ! ==> tags - exclude

# tags to exclude
if ( ( tags_not_in_list is not None ) and ( len( tags_not_in_list ) > 0 ) ):

    # exclude those in a list
    print( "exclude-ing articles with tags: " + str( tags_not_in_list ) )
    article_qs = article_qs.exclude( tags__name__in = tags_not_in_list )

#-- END check to see if we have a specific list of tags we want to exclude --#

# ! ==> tags - include only those with certain tags.
if ( ( tags_in_list is not None ) and ( len( tags_in_list ) > 0 ) ):

    # filter
    print( "filtering to just articles with tags: " + str( tags_in_list ) )
    article_qs = article_qs.filter( tags__name__in = tags_in_list )
    
#-- END check to see if we have a specific list of tags we want to include --#

# ! ==> filter out "*prelim*" tags?
if ( filter_out_prelim_tags == True ):

    # ifilter out all articles with any tag whose name contains "prelim".
    print( "filtering out articles with tags that contain \"prelim\"" )
    article_qs = article_qs.exclude( tags__name__icontains = "prelim" )
    
#-- END check to see if we filter out "prelim_*" tags --#

# how many is that?
article_count = article_qs.count()

print( "Article count after tag filtering: " + str( article_count ) )

# just want un-cleaned-up:
article_qs = article_qs.filter( Q( cleanup_status = Article.CLEANUP_STATUS_NEW ) | Q( cleanup_status__isnull = True ) )
article_count = article_qs.count()

print( "Article count after filtering on cleanup_status - filter out any that have already been cleaned up: " + str( article_count ) )

# ! ==> ORDER BY - do we want a random sample?
#random_count = 10
if ( random_count > 0 ):

    # to get random, order them by "?", then use slicing to retrieve requested
    #     number.
    article_qs = article_qs.order_by( "?" )[ : random_count ]

else:

    # order by ID (can't re-order once a slice is taken, so can't re-order if
    #     random sample).
    article_qs = article_qs.order_by( "id" )

#-- END check to see if we want random sample --#

# this is a nice algorithm, also:
# - http://www.titov.net/2005/09/21/do-not-use-order-by-rand-or-how-to-get-random-rows-from-table/

# how many is that?
article_count = article_qs.count()
print( "Article count after ORDER-ing: " + str( article_count ) )

# ! ==> LIMIT
# limit_count = 1
if ( limit_count > 0 ):

    article_qs = article_qs[ : limit_count ]
    
#-- END check to see if we limit. --#

# how many is that?
article_count = article_qs.count()
print( "Article count after LIMIT-ing: " + str( article_count ) )

#==============================================================================#
# ! analyze_author_info()
#==============================================================================#

# create instance of DTNB.
tdn = DTNB()

# run analysis.
tdn.analyze_author_info( article_qs )

# output details.
tdn.output_debug_message( "========================================" )
tdn.output_debug_message( "Found " + str( tdn.article_counter ) + " articles ( " + str( tdn.article_count ) + " )." )

# name in 1st paragraph?
tdn.output_debug_message( "\n" )
tdn.output_debug_message( "Found " + str( len( tdn.graf_1_has_name_id_list ) ) + " ( " + str( tdn.graf_1_has_name_count ) + " ) articles WITH name in 1st graf." )
#tdn.output_debug_message( "----> Sections: " + str( sorted( tdn.graf_1_has_name_section_list ) ) )
#tdn.output_debug_message( "----> IDs: " + str( sorted( tdn.graf_1_has_name_id_list ) ) )
tdn.output_debug_message( "" )
tdn.output_debug_message( "Found " + str( len( tdn.graf_1_no_name_id_list ) ) + " articles WITHOUT name in 1st graf." )
tdn.output_debug_message( "----> Sections: " + str( sorted( tdn.graf_1_no_name_section_list ) ) )
tdn.output_debug_message( "----> IDs (sorted): " + str( sorted( tdn.graf_1_no_name_id_list ) ) )
tdn.output_debug_message( "----> IDs: " + str( tdn.graf_1_no_name_id_list ) )
tdn.output_debug_message( "----> graf text: " )
tdn.output_debug_message( "----> NO NAME 1st GRAF anomaly details: " )

# output author name anomaly details.
for anomaly_detail in tdn.graf_1_no_name_detail_list:

    # get anomaly details
    author_anomaly_article_id = anomaly_detail.get( DTNB.AUTHOR_ANOMALY_DETAIL_ARTICLE_ID, -1 )
    author_anomaly_author_string = anomaly_detail.get( DTNB.AUTHOR_ANOMALY_DETAIL_AUTHOR_STRING, "" )
    author_anomaly_graf_1 = anomaly_detail.get( DTNB.AUTHOR_ANOMALY_DETAIL_GRAF_1, "" )
    author_anomaly_graf_2 = anomaly_detail.get( DTNB.AUTHOR_ANOMALY_DETAIL_GRAF_2, "" )
    
    # output them.
    anomaly_detail_string = "--------> article ID: " + str( author_anomaly_article_id )
    anomaly_detail_string += "\n- author_string: " + str( author_anomaly_author_string )
    anomaly_detail_string += "\n- graf 1: " + author_anomaly_graf_1
    anomaly_detail_string += "\n- graf 2: " + author_anomaly_graf_2
    anomaly_detail_string += "\n"
    tdn.output_debug_message( anomaly_detail_string )

#-- END loop over author anomaly details --#

tdn.output_debug_message( "----> got name?: " + str( tdn.graf_1_no_name_yes_author_count ) )

# "by" in 1st paragraph?
tdn.output_debug_message( "\n" )
tdn.output_debug_message( "Found " + str( len( tdn.graf_1_has_by_id_list ) ) + " ( " + str( tdn.graf_1_has_by_count ) + " ) articles WITH \"by\" in 1st graf." )
#tdn.output_debug_message( "----> Sections: " + str( sorted( tdn.graf_1_has_by_section_list ) ) )
#tdn.output_debug_message( "----> IDs: " + str( sorted( tdn.graf_1_has_by_id_list ) ) )
tdn.output_debug_message( "" )
tdn.output_debug_message( "Found " + str( len( tdn.graf_1_no_by_id_list ) ) + " articles WITHOUT \"by\" in 1st graf." )
tdn.output_debug_message( "----> Sections: " + str( sorted( tdn.graf_1_no_by_section_list ) ) )
tdn.output_debug_message( "----> IDs (sorted): " + str( sorted( tdn.graf_1_no_by_id_list ) ) )
tdn.output_debug_message( "----> IDs: " + str( tdn.graf_1_no_by_id_list ) )
tdn.output_debug_message( "----> graf text: " )
tdn.output_debug_message( "----> NO BY 1st GRAF anomaly details: " )

# output author name anomaly details.
for anomaly_detail in tdn.graf_1_no_by_detail_list:

    # get anomaly details
    author_anomaly_article_id = anomaly_detail.get( DTNB.AUTHOR_ANOMALY_DETAIL_ARTICLE_ID, -1 )
    author_anomaly_author_string = anomaly_detail.get( DTNB.AUTHOR_ANOMALY_DETAIL_AUTHOR_STRING, "" )
    author_anomaly_graf_1 = anomaly_detail.get( DTNB.AUTHOR_ANOMALY_DETAIL_GRAF_1, "" )
    author_anomaly_graf_2 = anomaly_detail.get( DTNB.AUTHOR_ANOMALY_DETAIL_GRAF_2, "" )
    
    # output them.
    anomaly_detail_string = "--------> article ID: " + str( author_anomaly_article_id )
    anomaly_detail_string += "\n- author_string: " + str( author_anomaly_author_string )
    anomaly_detail_string += "\n- graf 1: " + author_anomaly_graf_1
    anomaly_detail_string += "\n- graf 2: " + author_anomaly_graf_2
    anomaly_detail_string += "\n"
    tdn.output_debug_message( anomaly_detail_string )

#-- END loop over author anomaly details --#

tdn.output_debug_message( "----> got name?: " + str( tdn.graf_1_no_by_yes_author_count ) )

# "detroit news" in 2nd paragraph?
tdn.output_debug_message( "\n" )
tdn.output_debug_message( "Found " + str( len( tdn.graf_2_has_DN_id_list ) ) + " ( " + str( tdn.graf_2_has_DN_count ) + " ) articles WITH \"detroit news\" in 2nd graf." )
#tdn.output_debug_message( "----> Sections: " + str( sorted( tdn.graf_2_has_DN_section_list ) ) )
#tdn.output_debug_message( "----> IDs: " + str( sorted( tdn.graf_2_has_DN_id_list ) ) )
tdn.output_debug_message( "" )
tdn.output_debug_message( "Found " + str( len( tdn.graf_2_no_DN_id_list ) ) + " articles WITHOUT \"detroit news\" in 2nd graf." )
tdn.output_debug_message( "----> Sections: " + str( sorted( tdn.graf_2_no_DN_section_list ) ) )
tdn.output_debug_message( "----> IDs: " + str( sorted( tdn.graf_2_no_DN_id_list ) ) )
tdn.output_debug_message( "----> graf text: " )
tdn.output_debug_message( "----> NO \"detroit news\" 2nd GRAF anomaly details: " )

# output author name anomaly details.
for anomaly_detail in tdn.graf_2_no_DN_detail_list:

    # get anomaly details
    author_anomaly_article_id = anomaly_detail.get( DTNB.AUTHOR_ANOMALY_DETAIL_ARTICLE_ID, -1 )
    author_anomaly_author_string = anomaly_detail.get( DTNB.AUTHOR_ANOMALY_DETAIL_AUTHOR_STRING, "" )
    author_anomaly_graf_1 = anomaly_detail.get( DTNB.AUTHOR_ANOMALY_DETAIL_GRAF_1, "" )
    author_anomaly_graf_2 = anomaly_detail.get( DTNB.AUTHOR_ANOMALY_DETAIL_GRAF_2, "" )
    
    # output them.
    anomaly_detail_string = "--------> article ID: " + str( author_anomaly_article_id )
    anomaly_detail_string += "\n- author_string: " + str( author_anomaly_author_string )
    anomaly_detail_string += "\n- graf 1: " + author_anomaly_graf_1
    anomaly_detail_string += "\n- graf 2: " + author_anomaly_graf_2
    anomaly_detail_string += "\n"
    tdn.output_debug_message( anomaly_detail_string )

#-- END loop over author anomaly details --#

# all article IDs in set.
#tdn.output_debug_message( "\n" )
#tdn.output_debug_message( "List of " + str( len( tdn.article_id_list ) ) + " filtered article IDs: " + str( sorted( tdn.article_id_list ) ) )

#==============================================================================#
# ! processing - apply tags, capture author info
#==============================================================================#

# loop over articles?
if ( ( do_apply_tag == True ) or ( do_capture_author_info == True ) ):

    # capture author information
    processing_counter = 0
    processing_id_list = []
    processing_section_list = []
    for current_article in article_qs:
    
        # increment counter
        processing_counter += 1
        
        # update auditing information
        processing_id_list.append( current_article.id )
        if ( current_article.section not in processing_section_list ):

            # add hitherto unseen section to list.
            processing_section_list.append( current_article.section )
            
        #-- END check to see if we've already captured this section. --#
        
        print( "\nArticle " + str( processing_counter ) + " of " + str( article_count ) + ": " + str( current_article ) )

        # ! ==> capture_author_info()
        
        # are we capturing author info?
        if ( do_capture_author_info ):
    
            print( "- Attempting to capture author information from body of article.")
    
            # call capture_author_info()
            capture_status = tdn.capture_author_info( current_article )
            capture_status_list = capture_status.get_message_list()
            
            # output status list
            print( "====> " + str( processing_counter ) + " - Article ID: " + str( current_article.id ) + "; capture_status = " + str( capture_status ) )
            
        #-- END check to see if we try to capture author info --#
        
        # ! ==> apply tags
        
        # apply tag(s) while we are at it?
        if ( ( do_apply_tag == True ) and ( tags_to_apply_list is not None ) and ( len( tags_to_apply_list ) > 0 ) ):
        
            print( "- Applying tags " + str( tags_to_apply_list ) + " to article.")

            # yes, please.  Loop over tags list.
            for tag_to_apply in tags_to_apply_list:
            
                # tag the article with each tag in the list.
                current_article.tags.add( tag_to_apply )
                
                print( "====> Applied tag \"" + tag_to_apply + "\"." )
                
            #-- END loop over tag list. --#
            
            print( "- Tags for article " + str( current_article.id ) + " : " + str( current_article.tags.all() ) )
            
        #-- END check to see if we apply tag. --#
        
    #-- END loop over articles to capture author info. --#
    
    print( "\n" )
    print( "Processed " + str( processing_counter ) + " filtered article IDs: " + str( sorted( processing_id_list ) ) + " in the following sections of the paper: " + str( processing_section_list ) )

#-- END check to see if there is work to do. --#