**analysis-network_data_output-GRP.ipynb - Programmatic network data output**

# Notes

- for year comparisons, try to only do people for the years being compared, not all time (all time creates really large matrices).
- to start, do matrices for year before and year after layoffs, with the person query including both years.

    - store in the vm-share folder, mounted in VM.
    - see how large they are.
    - then, try loading them in R and doing a correlation. If we can make it work with relatively giant matrices on glassbox, great. if not, will need to try edge lists (I bet the real network packages don't store the whole graph, they just store the edge list and a list of the nodes).
    
- Need to go to edge lists - then, same edge list can be used with different node lists as long as all nodes are in node list.
    

# Setup

## Setup - Debug

- Back to [Table of Contents](#Table-of-Contents)

In [1]:
debug_flag = False

## Setup - Imports

In [2]:
# python base imports
import copy
import csv
import datetime
import hashlib
import json
import logging

# import six
import six

print( "packages imported at " + str( datetime.datetime.now() ) )

packages imported at 2022-06-04 01:33:57.299822


## Setup - working folder paths

- Back to [Table of Contents](#Table-of-Contents)

In [3]:
%pwd

'/home/jonathanmorgan/work/django/research/research/work/phd_work/analysis/network_data'

In [4]:
# current working folder
project_name = "research"
project_base_folder = "/home/jonathanmorgan/work/django/{project_name}".format( project_name = project_name )
django_project_folder = "{base_folder}/{project_name}".format(
    base_folder = project_base_folder,
    project_name = project_name
)
current_working_folder = "{django_project_folder}/work/phd_work/analysis/network_data".format(
    django_project_folder = django_project_folder
)
current_datetime = datetime.datetime.now()
current_date_string = current_datetime.strftime( "%Y-%m-%d-%H-%M-%S" )

# and, output path.
#network_data_output_folder_path = "/media/psf/phd_work/network_data"
network_data_output_folder_path = "/home/jonathanmorgan/shares/phd_work/network_data"

## Setup - logging

- Back to [Table of Contents](#Table-of-Contents)

configure logging for this notebook's kernel (If you do not run this cell, you'll get the django application's logging configuration.

In [5]:
# build file name
project_log_folder = "{base_folder}/logs".format( base_folder = project_base_folder )
logging_file_name = "{}/network_data_output-GRP-{}.log.txt".format( project_log_folder, current_date_string )

# set up logging.
logging.basicConfig(
    level = logging.DEBUG,
    format = '%(asctime)s - %(levelname)s - %(name)s - %(message)s',
    filename = logging_file_name,
    filemode = 'w' # set to 'a' if you want to append, rather than overwrite each time.
)

## Setup - Initialize Django

- Back to [Table of Contents](#Table-of-Contents)

First, initialize my dev django project, so I can run code in this notebook that references my django models and can talk to the database using my project's settings.

In [6]:
# init django
django_init_folder = "{django_project_folder}/work/phd_work".format(
    django_project_folder = django_project_folder
)
django_init_path = "django_init.py"
if( ( django_init_folder is not None ) and ( django_init_folder != "" ) ):
    
    # add folder to front of path.
    django_init_path = "{}/{}".format( django_init_folder, django_init_path )
    
#-- END check to see if django_init folder. --#

In [7]:
%run $django_init_path

django initialized at 2022-06-04 01:34:07.871157


### Setup - django-related imports

In [8]:
# python utilities
from python_utilities.strings.string_helper import StringHelper

# import class that actually processes requests for outputting networks.
from context_text.export.network_output import NetworkOutput

print( "django model packages imported at " + str( datetime.datetime.now() ) )

django model packages imported at 2022-06-04 01:34:08.701173


## Setup - functions

### Setup - function `make_string_hash()`

In [9]:
def make_string_hash( value_IN, hash_function_IN = hashlib.sha256 ):

    # return reference
    value_OUT = None

    # declare variables
    me = "make_string_hash"

    # call StringHelper method.
    value_OUT = StringHelper.make_string_hash( value_IN, hash_function_IN = hash_function_IN )

    return value_OUT

#-- END function make_string_hash() --#

print( "function make_string_hash() defined at " + str( datetime.datetime.now() ) )

function make_string_hash() defined at 2022-06-04 01:34:10.998848


### Setup - function `create_pre_post_networks()`

Accepts...



In [34]:
def create_pre_post_networks(
    date_IN = None,
    data_spec_IN = None,
    network_timedelta_IN = None,
    label_prefix_IN = None,
    output_folder_path_IN = None,
    do_create_IN = False,
    debug_flag_IN = False,
    network_outputter_IN = None,
    id_IN = None
):
    
    # return reference
    outputter_OUT = None
    
    # declare variables
    me = "create_pre_post_networks"
    status_message = None
    my_debug_flag = None
    my_data_spec_json = None
    do_create = None
    network_outputter = None

    # program control
    network_timedelta = None
    output_folder_path = None

    # declare variables - loop processing.
    pre_start_date = None
    pre_end_date = None
    pre_label = None
    post_start_date = None
    post_end_date = None
    post_label = None
    label_prefix = None

    #--------------------------------------------------------------------------#
    # initialize
    
    # init - from params
    base_date = date_IN
    my_data_spec_json = copy.deepcopy( data_spec_IN )
    network_timedelta = network_timedelta_IN
    label_prefix = label_prefix_IN
    output_folder_path = output_folder_path_IN
    do_create = do_create_IN
    my_debug_flag = debug_flag_IN
    network_outputter = network_outputter_IN
    
    # got a network outputter?
    if ( network_outputter is None ):
        
        # create one.
        network_outputter = NetworkOutput()
        
    #-- END check if network outputter --#

    # set output folder
    my_data_spec_json[ NetworkOutput.PARAM_NAME_SAVE_DATA_IN_FOLDER ] = output_folder_path

    # set up start and end dates for pre- and post-networks.
    pre_start_date = base_date - network_timedelta
    pre_end_date = base_date - one_day_delta
    post_start_date = base_date
    post_end_date = base_date + network_timedelta - one_day_delta
    
    # print status message
    status_message = "==> current time range"
    if ( id_IN is not None ):
        status_message = "{message} ( {index} )".format(
            message = status_message,
            index = id_IN
        )
    #-- END check if ID passed in --#
    status_message = "{message}: {pre_start} - {pre_end}; {post_start} - {post_end}".format(
        message = status_message,
        pre_start = pre_start_date,
        pre_end = pre_end_date,
        post_start = post_start_date,
        post_end = post_end_date
    )
    print( status_message )

    if ( do_create == True ):

        # create label prefix
        label_prefix = "{prefix}_{base_date}_".format(
            prefix = label_prefix,
            base_date = base_date
        )
        pre_label = "{}pre".format( label_prefix )
        post_label = "{}post".format( label_prefix )

        # set person query start and end dates.
        my_data_spec_json[ NetworkOutput.PARAM_PERSON_START_DATE ] = pre_start_date.isoformat()
        my_data_spec_json[ NetworkOutput.PARAM_PERSON_END_DATE ] = post_end_date.isoformat()

        #------------------------------------------------------------------#
        # ==> pre

        # update data creation spec.
        my_data_spec_json[ NetworkOutput.PARAM_START_DATE ] = pre_start_date.isoformat()
        my_data_spec_json[ NetworkOutput.PARAM_END_DATE ] = pre_end_date.isoformat()
        my_data_spec_json[ NetworkOutput.PARAM_NETWORK_LABEL ] = pre_label

        if ( my_debug_flag == True ):
            # make and output pre-network
            start_dt = datetime.datetime.now()
            print( "----> pre - starting network creation at {}".format( start_dt ) )
        #-- END DEBUG --#

        network_outputter = NetworkOutput()
        network_data = network_outputter.process_network_output_request(
            params_IN = my_data_spec_json,
            debug_flag_IN = False
        )

        if ( my_debug_flag == True ):
            # end time and duration
            end_dt = datetime.datetime.now()
            my_duration = end_dt - start_dt
            print( "----> pre - network creation complete at {}".format( end_dt ) )
            print( "--------> pre - duration: {}".format( my_duration ) )
        #-- END DEBUG --#

        #------------------------------------------------------------------#
        # ==> post

        # update data creation spec.
        my_data_spec_json[ NetworkOutput.PARAM_START_DATE ] = post_start_date.isoformat()
        my_data_spec_json[ NetworkOutput.PARAM_END_DATE ] = post_end_date.isoformat()
        my_data_spec_json[ NetworkOutput.PARAM_NETWORK_LABEL ] = post_label

        if ( my_debug_flag == True ):
            # make and output post-network
            start_dt = datetime.datetime.now()
            print( "----> post - starting network creation at {}".format( start_dt ) )
        #-- END DEBUG --#

        network_outputter = NetworkOutput()
        network_data = network_outputter.process_network_output_request(
            params_IN = my_data_spec_json,
            debug_flag_IN = False
        )

        if ( my_debug_flag == True ):
            # end time and duration
            end_dt = datetime.datetime.now()
            print( "----> post - network creation complete at {}".format( end_dt ) )
            my_duration = end_dt - start_dt
            print( "--------> post - duration: {}".format( my_duration ) )
        #-- END DEBUG --#

    #-- END check if we actually do the work --#
    
    outputter_OUT = network_outputter
    
    return outputter_OUT
    
#-- END function create_pre_post_networks --#

print( "function create_pre_post_networks() defined at " + str( datetime.datetime.now() ) )

function create_pre_post_networks() defined at 2022-06-04 02:16:05.026479


### Setup - function `create_pre_post_network_pairs()`

Accepts...



In [30]:
def create_pre_post_network_pairs(
    start_date_IN = None,
    end_date_IN = None,
    data_spec_IN = None,
    network_timedelta_IN = None,
    increment_timedelta_IN = None,
    label_prefix_IN = None,
    output_folder_path_IN = None,
    do_create_IN = False,
    debug_flag_IN = False,
    network_outputter_IN = None
):
    
    # return reference
    outputter_OUT = None
    
    # declare variables
    me = "create_pre_post_network_pairs"
    status_message = None
    my_debug_flag = None
    start_base_date = None
    end_base_date = None
    my_data_spec_json = None
    do_create = None
    network_outputter = None

    # program control
    network_timedelta = None
    increment_timedelta = None  # how much time between base dates where we measure?
    output_folder_path = None

    # declare variables - loop processing.
    base_date = None
    time_period_index = None
    label_prefix = None

    #--------------------------------------------------------------------------#
    # initialize
    
    # init - from params
    #my_data_spec_json = copy.deepcopy( data_spec_IN )
    my_data_spec_json = data_spec_IN  # create_pre_post_networks() makes a copy
    network_timedelta = network_timedelta_IN
    increment_timedelta = increment_timedelta_IN
    label_prefix = label_prefix_IN
    output_folder_path = output_folder_path_IN
    do_create = do_create_IN
    my_debug_flag = debug_flag_IN
    network_outputter = network_outputter_IN
    
    # got a network outputter?
    if ( network_outputter is None ):
        
        # create one.
        network_outputter = NetworkOutput()
        
    #-- END check if network outputter --#

    # init - length of network slices, and start and end base dates.
    start_base_date = start_date_IN + network_timedelta
    #start_base_date = start_base_date + one_day_delta
    end_base_date = end_date_IN - network_timedelta
    end_base_date = end_base_date + one_day_delta
    
    print( "processing base dates from {} to {}".format( start_base_date, end_base_date ) )

    #--------------------------------------------------------------------------#
    # loop over base dates, creating previous year and current year matrices for each.
    base_date = start_base_date
    #end_base_date = start_base_date
    time_period_index = 0
    while base_date <= end_base_date:

        # increment index
        time_period_index += 1

        # call create_pre_post_networks()
        create_pre_post_networks(
            date_IN = base_date,
            data_spec_IN = my_data_spec_json,
            network_timedelta_IN = network_timedelta,
            label_prefix_IN = label_prefix,
            output_folder_path_IN = output_folder_path,
            do_create_IN = do_create,
            debug_flag_IN = my_debug_flag,
            network_outputter_IN = network_outputter,
            id_IN = time_period_index
        )

        # increment base date before starting loop again.
        #print( "increment_time_delta: {}".format( increment_timedelta ) )
        base_date = base_date + increment_timedelta

    #-- END loop over base dates --#
    
    outputter_OUT = network_outputter
    
    return outputter_OUT
    
#-- END function create_pre_post_network_pairs --#

print( "function create_pre_post_network_pairs() defined at " + str( datetime.datetime.now() ) )

function create_pre_post_network_pairs() defined at 2022-06-04 02:13:14.763792


## Setup - base data spec

Network data spec that includes:

- `Article_Data` and `Person` queries the same...:

    - _`coders` (`person_coders`)_: 2 (automated coder, id = 2)
    - coder type "OpenCalais_REST_API_v2"
    
        - _`coder_type_filter_type` (`person_coder_type_filter_type`)_: "automated"
        - _`coder_types_list` (`person_coder_types_list`)_: "OpenCalais_REST_API_v2"
    
    - _`publications` (`person_publications`)_: 1 (Grand Rapids Press)
    - all dates in database (from 2005-01-01 to 2010-11-30)
    
        - _`start_date` (`person_start_date`)_: "2005-01-01"
        - _`end_date` (`person_end_date`)_: "2010-11-30"
    
    - only articles tagged with `local_hard_news` and `coded-OpenCalaisV2ArticleCoder`.

        - _`tags_list` (`person_tags_list`)_: "local_hard_news,coded-OpenCalaisV2ArticleCoder"

- ...EXCEPT allowing duplicate articles for person so you get absolutely all persons, but not for `Article_Data` query.

    - _`person_allow_duplicate_articles`_: "yes"

- Network data creation options:

    - excludes persons with single word (no spaces) `verbatim_name`.
    
        - _`include_persons_with_single_word_name`_: "no"
    
    - exclude render details
        
        - _`network_include_render_details`_: "no"
        
    - ouput as tab-delimited matrix, with node attributes as additional columns on the far right of the square network part of the matrix.

        - _`output_type`_: "tab_delimited_matrix"
        - _`network_data_output_type`_: "net_and_attr_cols"

    - label - _`network_label`_: "all_grp_hard_news"
    - include header row in the matrix output file.
    
        - _`network_include_headers`_: "yes"

    - output spec plus the resulting network data to the database, with lable set to `network_label` plus a date-time string.
    
        - _`database_output`_: "yes",
        - _`db_add_timestamp_to_label`_: "yes"

_NOTE: only pass True to `network_outputter.process_network_output_request( debug_flag_IN )` if you really need to debug - it adds garbage data at the end of the output, even if you ask for no render details._


In [11]:
base_data_spec_json_string = """
    "start_date": "2005-01-01",
    "end_date": "2005-12-31",
    "date_range": "",
    "publications": "1",
    "coders": "2",
    "coder_id_priority_list": "",
    "coder_type_filter_type": "automated",
    "coder_types_list": "OpenCalais_REST_API_v2",
    "tags_list": "local_hard_news",
    "unique_identifiers": "",
    "allow_duplicate_articles": "no",
    "person_query_type": "custom",
    "person_start_date": "2005-01-01",
    "person_end_date": "2005-12-31",
    "person_date_range": "",
    "person_publications": "1",
    "person_coders": "2",
    "person_coder_id_priority_list": "",
    "person_coder_type_filter_type": "automated",
    "person_coder_types_list": "OpenCalais_REST_API_v2",
    "person_tags_list": "local_hard_news",
    "person_unique_identifiers": "",
    "person_allow_duplicate_articles": "yes",
    "include_source_contact_types": [
        "direct",
        "event",
        "past_quotes",
        "document",
        "other"
    ],
    "exclude_persons_with_tags_in_list": "",
    "include_persons_with_single_word_name": "no",
    "network_download_as_file": "no",
    "network_include_render_details": "no",
    "output_type": "tab_delimited_matrix",
    "network_data_output_type": "net_and_attr_cols",
    "network_label": "all_grp_hard_news_2005",
    "network_include_headers": "yes",
    "database_output": "yes",
    "db_add_timestamp_to_label": "yes",
    "db_save_data_in_database": "no",
    "save_data_in_folder": "{output_folder_path}"
"""

base_data_spec_json_string = base_data_spec_json_string.format(
    output_folder_path = network_data_output_folder_path
)
base_data_spec_json_string = "{left_curly}{json_properties}{right_curly}".format(
    left_curly = "{",
    json_properties = base_data_spec_json_string,
    right_curly = "}"
)

base_data_spec_json = json.loads( base_data_spec_json_string )
print( base_data_spec_json ) 

{'start_date': '2005-01-01', 'end_date': '2005-12-31', 'date_range': '', 'publications': '1', 'coders': '2', 'coder_id_priority_list': '', 'coder_type_filter_type': 'automated', 'coder_types_list': 'OpenCalais_REST_API_v2', 'tags_list': 'local_hard_news', 'unique_identifiers': '', 'allow_duplicate_articles': 'no', 'person_query_type': 'custom', 'person_start_date': '2005-01-01', 'person_end_date': '2005-12-31', 'person_date_range': '', 'person_publications': '1', 'person_coders': '2', 'person_coder_id_priority_list': '', 'person_coder_type_filter_type': 'automated', 'person_coder_types_list': 'OpenCalais_REST_API_v2', 'person_tags_list': 'local_hard_news', 'person_unique_identifiers': '', 'person_allow_duplicate_articles': 'yes', 'include_source_contact_types': ['direct', 'event', 'past_quotes', 'document', 'other'], 'exclude_persons_with_tags_in_list': '', 'include_persons_with_single_word_name': 'no', 'network_download_as_file': 'no', 'network_include_render_details': 'no', 'output_t

### Setup - update base data spec for different time slices

To update this for different time slices:

- make a copy of `base_data_spec_json`:

    - not threadsafe:
    
            my_timeslice_spec = copy.deepcopy( base_data_spec_json )
    
    - threadsafe (but doesn't handle complex data types - ours is just JSON, though, so fine here):
    
            my_timeslice_spec = json.loads( json.dumps( base_data_spec_json ) )

- update the `start_date` and `end_date` to the period you want for your time slice.

        my_timeslice_spec[ NetworkOutput.PARAM_START_DATE ] = "2009-12-01"
        my_timeslice_spec[ NetworkOutput.PARAM_END_DATE ] = "2009-12-31"

- update the `network_label` value so that it captures what time slice you are making.

        my_timeslice_spec[ NetworkOutput.PARAM_NETWORK_LABEL ] = "month-grp-automated-20091201-20091231"

    - example pattern: <type>-<paper>-<coder>-<start_date>-<end_date>
    - examples:
        
            week-grp-automated-20050501-20050507
            7day-grp-automated-20050502-20050508

    - type would be either:

        - actual time period:

            - week
            - month
            - quarter
            - half-year
            - year

        - conceptual time period:

            - sliding week = "7day"
            - sliding month = "31day"
            - sliding quarter = "92day"
            - sliding half-year = "183day"
            - sliding year = "365day"

_NOTE: leave person query parameters the same for all networks if you want all your network matrices to have same set of people (same count and position of rows and columns) so each network can be compared to all others, regardless of time period of a given network slice._

In [13]:
'''
# make a copy of base data spec
before_data_spec_json = copy.deepcopy( base_data_spec_json )

# update properties
before_data_spec_json[ NetworkOutput.PARAM_START_DATE ] = "2009-01-08"
before_data_spec_json[ NetworkOutput.PARAM_END_DATE ] = "2010-01-08"
before_data_spec_json[ NetworkOutput.PARAM_PERSON_START_DATE ] = "2009-01-08"
before_data_spec_json[ NetworkOutput.PARAM_PERSON_END_DATE ] = "2011-01-08"
before_data_spec_json[ NetworkOutput.PARAM_NETWORK_LABEL ] = "grp_year_before_layoff"

print( "updated data spec:\n{}".format( json.dumps( before_data_spec_json, sort_keys = True, indent = 4 ) ) )
'''

'\n# make a copy of base data spec\nbefore_data_spec_json = copy.deepcopy( base_data_spec_json )\n\n# update properties\nbefore_data_spec_json[ NetworkOutput.PARAM_START_DATE ] = "2009-01-08"\nbefore_data_spec_json[ NetworkOutput.PARAM_END_DATE ] = "2010-01-08"\nbefore_data_spec_json[ NetworkOutput.PARAM_PERSON_START_DATE ] = "2009-01-08"\nbefore_data_spec_json[ NetworkOutput.PARAM_PERSON_END_DATE ] = "2011-01-08"\nbefore_data_spec_json[ NetworkOutput.PARAM_NETWORK_LABEL ] = "grp_year_before_layoff"\n\nprint( "updated data spec:\n{}".format( json.dumps( before_data_spec_json, sort_keys = True, indent = 4 ) ) )\n'

## Setup - shared datetime instances

To start, make:

- `datetime.date`s for:

    - start date (2005-01-01)
    - end date (2010-11-30)
    - layoff date (2010-01-08)

- `datetime.timedelta`s for 1 year (365 days) and 1 day.

In [12]:
# declare variables
first_article_date = None
last_article_date = None
layoff_date = None
one_year_delta = None
six_month_delta = None
three_month_delta = None
month_31_delta = None
one_day_delta = None

# make dates and timedeltas
first_article_date = datetime.date( 2005, 1, 1 )
print( "First article date: {}".format( first_article_date ) )

last_article_date = datetime.date( 2010, 11, 30 )
print( "Last article date: {}".format( last_article_date ) )

layoff_date = datetime.date( 2010, 1, 8 )
print( "Layoff date: {}".format( layoff_date ) )

one_year_delta = datetime.timedelta( days = 365 )
print( "One year delta: {}".format( one_year_delta ) )

nine_month_delta = datetime.timedelta( days = 274 )
print( "nine month-ish delta: {}".format( six_month_delta ) )

six_month_delta = datetime.timedelta( days = 183 )
print( "six month-ish delta: {}".format( six_month_delta ) )

three_month_delta = datetime.timedelta( days = 92 )
print( "three month-ish delta: {}".format( three_month_delta ) )

month_31_delta = datetime.timedelta( days = 31 )
print( "month-ish delta: {}".format( month_31_delta ) )

one_day_delta = datetime.timedelta( days = 1 )
print( "One day delta: {}".format( one_day_delta ) )

First article date: 2005-01-01
Last article date: 2010-11-30
Layoff date: 2010-01-08
One year delta: 365 days, 0:00:00
six month-ish delta: 183 days, 0:00:00
three month-ish delta: 92 days, 0:00:00
month-ish delta: 31 days, 0:00:00
One day delta: 1 day, 0:00:00


# network data output example - base data spec

In [None]:
# try creating network data.
start_dt = datetime.datetime.now()
print( "==> starting network creation at {}".format( start_dt ) )

network_outputter = NetworkOutput()
network_data = network_outputter.process_network_output_request(
    params_IN = base_data_spec_json,
    debug_flag_IN = False
)

end_dt = datetime.datetime.now()
print( "==> network creation complete at {}".format( end_dt ) )

# duration:
my_duration = end_dt - start_dt
print( "----> duration: {}".format( my_duration ) )

- if include_persons_with_single_word_name = "yes": 2427606
- if include_persons_with_single_word_name = "no": 2344545

In [None]:
# create a hash of the data, for comparison
network_data_hash = make_string_hash( network_data )
print( "Network data hash: {}".format( network_data_hash ) )

# match?
should_be = "d3fd8b3a0daa0c9e4b05a7017b51b16bbae95be1e11b0cb1293c6554867bf201"
if ( network_data_hash != should_be ):
    
    # not right hash. Error.
    print( "ERROR! network data hash is {}, should be {}".format( network_data_hash, should_be ) )
    
else:
    
    # a match
    print( "MATCH - network data hash {} matches expected. hooray!".format( network_data_hash ) )
    
#-- END debug/test --#

In [None]:
network_data_length = len( network_data )
should_be = 379118986
print( "Network data length: {}".format( network_data_length ) )
if ( network_data_length != should_be ):
    
    # not right length. Error.
    print( "ERROR! network data length is {}, should be {}".format( network_data_length, should_be ) )
    
else:
    
    # a match
    print( "MATCH - string len()gth of {} matches expected. hooray!".format( network_data_length ) )
    
#-- END debug/test --#

In [None]:
# look at master person dict
master_person_dict = network_outputter.create_person_dict( load_person_IN = True )

# how many entries?
person_count = len( master_person_dict )
print( "- person count: {person_count}".format( person_count = person_count ) )

# right number?
should_be = 13755
if ( person_count != should_be ):
    
    # not right length. Error.
    print( "ERROR! person count is {}, should be {}".format( person_count, should_be ) )
    
else:
    
    # a match
    print( "MATCH - person count of {} matches expected. hooray!".format( person_count ) )
    
#-- END debug/test --#

# network data output - years around 1/8/2010

In [None]:
layoff_date = datetime.date( 2010, 1, 8 )
print( "Layoff date: {}".format( layoff_date ) )

In [None]:
one_year_delta = datetime.timedelta( days = 365 )
print( "One year delta: {}".format( one_year_delta ) ) 

In [None]:
one_year_before_date = layoff_date - one_year_delta
print( "one_year_before_date = {}".format( one_year_before_date ) )

In [None]:
one_year_after_date = layoff_date + one_year_delta
print( "one_year_after_date = {}".format( one_year_after_date ) )

## years around 1/8/2010

In [14]:
# make a NetworkOutput instance to re-use
network_outputter = NetworkOutput()

In [27]:
# set up parameters
my_label = "grp_layoff"
my_output_folder_path = "{base_output_path}/{label}".format(
    base_output_path = network_data_output_folder_path,
    label = my_label
)

# call function to make data.
create_pre_post_networks(
    date_IN = layoff_date,
    data_spec_IN = base_data_spec_json,
    network_timedelta_IN = one_year_delta,
    label_prefix_IN = my_label,
    output_folder_path_IN = my_output_folder_path,
    do_create_IN = False,
    debug_flag_IN = True,
    network_outputter_IN = network_outputter
)

==> current time range: 2009-01-08 - 2010-01-07; 2010-01-08 - 2011-01-07
----> pre - starting network creation at 2022-06-04 01:54:24.385705
----> pre - network creation complete at 2022-06-04 01:56:55.499955
--------> pre - duration: 0:02:31.114250
----> post - starting network creation at 2022-06-04 01:56:55.500104
----> post - network creation complete at 2022-06-04 01:59:17.775149
--------> post - duration: 0:02:22.275045


<context_text.export.network_output.NetworkOutput at 0x7f847c6445e0>

## year before 1/8/2010

In [None]:
# make a copy of base data spec
before_data_spec_json = copy.deepcopy( base_data_spec_json )

# update properties
before_data_spec_json[ NetworkOutput.PARAM_START_DATE ] = "2009-01-08"
before_data_spec_json[ NetworkOutput.PARAM_END_DATE ] = "2010-01-07"
before_data_spec_json[ NetworkOutput.PARAM_PERSON_START_DATE ] = "2009-01-08"
before_data_spec_json[ NetworkOutput.PARAM_PERSON_END_DATE ] = "2011-01-07"
before_data_spec_json[ NetworkOutput.PARAM_NETWORK_LABEL ] = "grp_year_before_layoff"

print( "updated data spec:\n{}".format( json.dumps( before_data_spec_json, sort_keys = True, indent = 4 ) ) )

In [None]:
# try creating network data.
start_dt = datetime.datetime.now()
print( "----> starting network creation at {}".format( start_dt ) )

network_outputter = NetworkOutput()
network_data = network_outputter.process_network_output_request(
    params_IN = before_data_spec_json,
    debug_flag_IN = False
)

end_dt = datetime.datetime.now()
print( "----> network creation complete at {}".format( end_dt ) )

# duration:
my_duration = end_dt - start_dt
print( "--------> duration: {}".format( my_duration ) )

In [None]:
# create a hash of the data, for comparison
network_data_hash = make_string_hash( network_data )
print( "Network data hash: {}".format( network_data_hash ) )

# match?
should_be = "bdb2945558d568ad8758d78c4b7be3e3f65ff3f569acc14c4c837c2b14170266"
if ( network_data_hash != should_be ):
    
    # not right hash. Error.
    print( "ERROR! network data hash is {}, should be {}".format( network_data_hash, should_be ) )
    
else:
    
    # a match
    print( "MATCH - network data hash {} matches expected. hooray!".format( network_data_hash ) )
    
#-- END debug/test --#

In [None]:
network_data_length = len( network_data )
should_be = 579121117
print( "Network data length: {}".format( network_data_length ) )
if ( network_data_length != should_be ):
    
    # not right length. Error.
    print( "ERROR! network data length is {}, should be {}".format( network_data_length, should_be ) )
    
else:
    
    # a match
    print( "MATCH - string len()gth of {} matches expected. hooray!".format( network_data_length ) )
    
#-- END debug/test --#

In [None]:
# look at master person dict
master_person_dict = network_outputter.create_person_dict( load_person_IN = True )

# how many entries?
person_count = len( master_person_dict )
print( "- person count: {person_count}".format( person_count = person_count ) )

# right number?
should_be = 17003
if ( person_count != should_be ):
    
    # not right length. Error.
    print( "ERROR! person count is {}, should be {}".format( person_count, should_be ) )
    
else:
    
    # a match
    print( "MATCH - person count of {} matches expected. hooray!".format( person_count ) )
    
#-- END debug/test --#

## year after 1/8/2020

In [None]:
# make a copy of base data spec
after_data_spec_json = copy.deepcopy( base_data_spec_json )

# update properties
after_data_spec_json[ NetworkOutput.PARAM_START_DATE ] = "2010-01-08"
after_data_spec_json[ NetworkOutput.PARAM_END_DATE ] = "2011-01-07"
after_data_spec_json[ NetworkOutput.PARAM_PERSON_START_DATE ] = "2009-01-08"
after_data_spec_json[ NetworkOutput.PARAM_PERSON_END_DATE ] = "2011-01-07"
after_data_spec_json[ NetworkOutput.PARAM_NETWORK_LABEL ] = "grp_year_after_layoff"

print( "updated data spec:\n{}".format( json.dumps( after_data_spec_json, sort_keys = True, indent = 4 ) ) )

In [None]:
# try creating network data.
start_dt = datetime.datetime.now()
print( "==> starting network creation at {}".format( start_dt ) )

network_outputter = NetworkOutput()
network_data = network_outputter.process_network_output_request(
    params_IN = after_data_spec_json,
    debug_flag_IN = False
)

end_dt = datetime.datetime.now()
print( "==> network creation complete at {}".format( end_dt ) )

# duration:
my_duration = end_dt - start_dt
print( "----> duration: {}".format( my_duration ) )

- if include_persons_with_single_word_name = "yes": 2427606
- if include_persons_with_single_word_name = "no": 2344545

In [None]:
# create a hash of the data, for comparison
network_data_hash = make_string_hash( network_data )
print( "Network data hash: {}".format( network_data_hash ) )

# match?
should_be = "03562c3e38bb0f0a2feda08da44912291f1fc443c6c7f75bdd43f182cc30ecfa"
if ( network_data_hash != should_be ):
    
    # not right hash. Error.
    print( "ERROR! network data hash is {}, should be {}".format( network_data_hash, should_be ) )
    
else:
    
    # a match
    print( "MATCH - network data hash {} matches expected. hooray!".format( network_data_hash ) )
    
#-- END debug/test --#

In [None]:
network_data_length = len( network_data )
should_be = 579123903
print( "Network data length: {}".format( network_data_length ) )
if ( network_data_length != should_be ):
    
    # not right length. Error.
    print( "ERROR! network data length is {}, should be {}".format( network_data_length, should_be ) )
    
else:
    
    # a match
    print( "MATCH - string len()gth of {} matches expected. hooray!".format( network_data_length ) )
    
#-- END debug/test --#

In [None]:
# look at master person dict
master_person_dict = network_outputter.create_person_dict( load_person_IN = True )

# how many entries?
person_count = len( master_person_dict )
print( "- person count: {person_count}".format( person_count = person_count ) )

# right number?
should_be = 13755
if ( person_count != should_be ):
    
    # not right length. Error.
    print( "ERROR! person count is {}, should be {}".format( person_count, should_be ) )
    
else:
    
    # a match
    print( "MATCH - person count of {} matches expected. hooray!".format( person_count ) )
    
#-- END debug/test --#

# pairs of years within GRP data

Start with first article date plus 365 days, go forward one day at a time making 1 year snapshots for the year before and after each date, with person query covering both years. 

In [32]:
# make a NetworkOutput instance to re-use
network_outputter = NetworkOutput()

## pairs of years, sample every 31 days.

In [37]:
# set up parameters
my_label = "grp_years"
my_output_folder_path = "{base_output_path}/{label}".format(
    base_output_path = network_data_output_folder_path,
    label = my_label
)

# call function to make data.
create_pre_post_network_pairs(
    start_date_IN = first_article_date,
    end_date_IN = last_article_date,
    data_spec_IN = base_data_spec_json,
    network_timedelta_IN = one_year_delta,
    increment_timedelta_IN = month_31_delta,
    label_prefix_IN = my_label,
    output_folder_path_IN = my_output_folder_path,
    do_create_IN = True,
    debug_flag_IN = True,
    network_outputter_IN = network_outputter
)

processing base dates from 2006-01-01 to 2009-12-01
==> current time range ( 1 ): 2005-01-01 - 2005-12-31; 2006-01-01 - 2006-12-31
----> pre - starting network creation at 2022-06-04 02:16:33.558518
----> pre - network creation complete at 2022-06-04 02:20:41.896876
--------> pre - duration: 0:04:08.338358
----> post - starting network creation at 2022-06-04 02:20:41.897033
----> post - network creation complete at 2022-06-04 02:24:43.512117
--------> post - duration: 0:04:01.615084
==> current time range ( 2 ): 2005-02-01 - 2006-01-31; 2006-02-01 - 2007-01-31
----> pre - starting network creation at 2022-06-04 02:24:43.542847
----> pre - network creation complete at 2022-06-04 02:28:45.967748
--------> pre - duration: 0:04:02.424901
----> post - starting network creation at 2022-06-04 02:28:45.967904
----> post - network creation complete at 2022-06-04 02:32:58.429820
--------> post - duration: 0:04:12.461916
==> current time range ( 3 ): 2005-03-04 - 2006-03-03; 2006-03-04 - 2007-03-

<context_text.export.network_output.NetworkOutput at 0x7f84763e13c0>

# pairs of 6-months within GRP data

Start with first article date plus 183 (365/2, rounded up) days, go forward one day at a time making half-year snapshots for the half-year before and after each date, with person query covering both half-years. 

In [None]:
# make a NetworkOutput instance to re-use
network_outputter = NetworkOutput()

## pairs of 6-months, sample every 31 days

In [None]:
# set up parameters
my_label = "grp_6mos_by_month"
my_output_folder_path = "{base_output_path}/{label}".format(
    base_output_path = network_data_output_folder_path,
    label = my_label
)

# call function to make data.
create_pre_post_network_pairs(
    start_date_IN = first_article_date,
    end_date_IN = last_article_date,
    data_spec_IN = base_data_spec_json,
    network_timedelta_IN = six_month_delta,
    increment_timedelta_IN = month_31_delta,
    label_prefix_IN = my_label,
    output_folder_path_IN = my_output_folder_path,
    do_create_IN = True,
    debug_flag_IN = True,
    network_outputter_IN = network_outputter
)

processing base dates from 2005-07-03 to 2010-06-01
==> current time range ( 1 ): 2005-01-01 - 2005-07-02; 2005-07-03 - 2006-01-01
----> pre - starting network creation at 2022-06-04 11:57:39.648172
----> pre - network creation complete at 2022-06-04 11:59:28.307220
--------> pre - duration: 0:01:48.659048
----> post - starting network creation at 2022-06-04 11:59:28.307371
----> post - network creation complete at 2022-06-04 12:01:17.869698
--------> post - duration: 0:01:49.562327
==> current time range ( 2 ): 2005-02-01 - 2005-08-02; 2005-08-03 - 2006-02-01
----> pre - starting network creation at 2022-06-04 12:01:17.884553
----> pre - network creation complete at 2022-06-04 12:03:05.601022
--------> pre - duration: 0:01:47.716469
----> post - starting network creation at 2022-06-04 12:03:05.601191
----> post - network creation complete at 2022-06-04 12:04:52.547317
--------> post - duration: 0:01:46.946126
==> current time range ( 3 ): 2005-03-04 - 2005-09-02; 2005-09-03 - 2006-03-

# write network data to file

In [None]:
# write the output to a file
current_date_time = None
my_file_extension = None
network_data_file_path = None
network_data_file = None

# time stamp and file extension to append to file name
current_date_time = datetime.datetime.now().strftime( '%Y%m%d-%H%M%S' )
my_file_extension = "txt"

# make file path.
network_data_file_path = "context_text_data-{timestamp}.{file_extension}".format(
    timestamp = current_date_time,
    file_extension = my_file_extension
)

# write to file.
with open( network_data_file_path, 'w' ) as network_data_file:

    # output all the data to file.
    network_data_file.write( network_data )
    
#-- END with open( network_data_file_path, 'w' ) as network_data_file --#

print( "network data written to file {} at {}".format( network_data_file_path, datetime.datetime.now() ) )

# Explore data

In [None]:
# set data file path.
data_file_name = "all_grp_hard_news_2005-20220602-012223"
data_file_path = "{output_folder_path}/{data_file_name}".format(
    output_folder_path = network_data_output_folder_path,
    data_file_name = data_file_name
)
update_every_x = 1000

print( data_file_path )

In [None]:
# declare variables
data_file = None
data_file_reader = None
data_file_line = None
data_file_line_item_list = None
person_info = None
person_info_count = None
person_info_counter = None
person_info_lower = None
counter_unknown = None
counter_author = None
counter_source = None
counter_both = None
update_every_x = 1000

# Open network data output file for reading.
with open( data_file_path, "r" ) as data_file:
    
    # csv.reader
    #data_file_reader = csv.reader( data_file, delimiter=':', quoting=csv.QUOTE_NONE )
    
    # read first line.
    data_file_line = data_file.readline()

    # split on tabs.
    data_file_line_item_list = data_file_line.split( "\t" )
    
#-- END with open( data_file_path, "r" ) as data_file_name: --#

person_info_count = len( data_file_line_item_list )
person_info_counter = 0

# loop and add up different person types.
counter_unknown = 0
counter_author = 0
counter_source = 0
counter_both = 0
for person_info in data_file_line_item_list:
    
    # increment counter
    person_info_counter += 1
    
    # does string contain...
    person_info_lower = person_info.lower()
    
    # ==> "unknown"
    if ( "unknown" in person_info_lower ):
        
        counter_unknown += 1
        
    #== END check if unknown --#

    # ==> "author"
    if ( "author" in person_info_lower ):

        counter_author += 1

    #== END check if author --#

    # ==> "source"
    if ( "source" in person_info_lower ):

        counter_source += 1

    #== END check if source --#
        
    # ==> "both"
    if ( "both" in person_info_lower ):
    
        counter_both += 1
    
    #== END check if both --#
        
    # time to give brief update?
    if ( ( person_info_counter % update_every_x ) == 0 ):
        
        # yes.
        status_message = "----> finished processing {counter} of {total} @ {my_timestamp}".format(
            counter = person_info_counter,
            total = person_info_count,
            my_timestamp = datetime.datetime.now()
        )
        print( status_message )
        
    #-- END check if update time. --#

# END loop over header line of data file. --#

print( "\n" )
print( "Finished processing {record_count} header column names:".format( record_count = person_info_counter ) )
print( "- counter_unknown = {}".format( counter_unknown ) )
print( "- counter_author = {}".format( counter_author ) )
print( "- counter_source = {}".format( counter_source ) )
print( "- counter_both = {}".format( counter_both ) )

In [None]:
# declare variables
data_file = None
data_file_reader = None
data_file_line = None
data_file_line_item_list = None
data_file_item = None
data_file_item_value = None
counter_tie = None
sum_weight = None
counter_zero = None
counter_negative = None
counter_other = None
counter_empty = None
update_every_x = None
row_counter = None
column_counter = None

# Open network data output file for reading.
counter_tie = 0
sum_weight = 0
counter_zero = 0
counter_negative = 0
counter_other = 0
counter_empty = 0
update_every_x = 1000
with open( data_file_path, "r" ) as data_file:
    
    # csv.reader
    #data_file_reader = csv.reader( data_file, delimiter=':', quoting=csv.QUOTE_NONE )
    
    # try to move past header first line.
    data_file_line = data_file.readline()

    # loop over lines in file
    row_counter = 0
    for data_file_line in data_file:
    
        row_counter += 1
    
        # split on tabs.
        data_file_line_item_list = data_file_line.split( "\t" )
        
        # then, loop over items in list. For each, if not empty and not 0,
        #     add 1 to counter of empty cells and add number to weight-aggregator.
        column_counter = 0
        for data_file_item in data_file_line_item_list:
            
            column_counter += 1
            
            # is it a number > 0?
            if (
                ( data_file_item is not None )
                and ( data_file_item != "" )
            ):
                
                # try to cast to int.
                try:
                    
                    # cast to int
                    data_file_item_value = int( data_file_item )
                    
                    # an int! > 0?
                    if ( data_file_item_value > 0 ):
                   
                        # yes! an actual tie
                        counter_tie += 1
                        sum_weight += data_file_item_value
                        
                    elif ( data_file_item_value == 0 ): 
                        
                        # no tie - increment counter_zero
                        counter_zero += 1
                        
                    else:
                        
                        # neither 0 or greater than 0...
                        counter_negative += 1
                        
                    #-- END check what is in int... --#
                    
                except:
            
                    # either string (or something else).
                    counter_other += 1

                #-- END try...except --#
                
            else:
                
                # empty...?
                counter_empty += 1
                
            #-- END check if None or "" --#
            
        #-- END loop over items in line --#

        # time to give brief update?
        if ( ( row_counter % update_every_x ) == 0 ):

            # yes.
            status_message = "----> finished processing {row_counter} rows @ {my_timestamp}".format(
                row_counter = row_counter,
                my_timestamp = datetime.datetime.now()
            )
            print( status_message )

        #-- END check if update time. --#
          
    #--- END loop over lines in file. --#
    
#-- END with open( data_file_path, "r" ) as data_file_name: --#

print( "\n" )
print( "Finished processing {row_counter} rows:".format( row_counter = row_counter ) )
print( "- counter_tie = {}".format( counter_tie ) )
print( "-----> sum_weight = {}".format( sum_weight ) )
print( "- counter_zero = {}".format( counter_zero ) )
print( "- counter_negative = {}".format( counter_negative ) )
print( "- counter_other = {}".format( counter_other ) )
print( "- counter_empty = {}".format( counter_empty ) )

Finished processing 50823 rows:
- counter_tie = 137547
- -----> sum_weight = 2077935942
- counter_zero = 2582941428
- counter_negative = 0
- counter_other = 50823