Notes.txt

Using machine learning to automate literature triage tasks.

Backpopulate/  
    - code for getting old articles that curators have not selected in the past.
	These are negative, non-MGI-relevant articles.

Lessons Learned and Where We Are

Understanding Vectorizers
    vect = CountVectorizer( stop_words="english", ngram_range=(1,2))
    * tokenized 1st (e.g., break/remove punct), stopwords are removed,
	then ngrams built, then min_df/max_df applied

Dec 5 2018: vectorizer things:
    in extracted text:
	** *** and **** can appear as tokens aa aaa and aaaa
	+/- can appear as "ae"
	might want to remove doi:... in addition to URLs
	"knock in" and "knock down" are fairly common ("knock" in 5.3% docs)


Previous thinking for python scripts in MLtextTools:
    * getTrainingData.py would get a delimited file of samples
    * preprocessSamples.py would massage the data in the delimited file(s)
    * populateTrainigDirs.py would take a massaged delimited file and populate
	the dirs that sklearn's load_files() wants.
    Would it be better to have getTrainingData populate text files in the dirs
    and massage the data in place (or into additional cooked files)?

    BUT MLtextTools already has stuff in place, so lets keep with that 
    paradigm. Can always run populateTrainingDirs.py at any time to generate
    the text files to look at.
    	- should add parameter to specify file extension (done)
	- should make it not assume class names are 'yes' 'no' (done)

NER thoughts
    Becas is very unreliable, down much of the time.
    Haven't found any other online tool (need to look some more).
    Pubtator does NER on pubmed abstracts and titles, but it won't do anything
	with figure text.
    I'm pondering doing some simple, easy to code, dictionary lookup approach
    myself.
	It doesn't have to be perfect, or even very good maybe.
	But where to get the dictionary? Some thoughts:
	    Build by hand from MGI gene symbols and ontology terms?
	    Pull mapping files for all of pubmed from Pubtator and distill
		out the entity mappings - easy to program, but a honking pile
		of data.
	    For our particular set of pubmed IDs, pull the mappings out of 
		Pubtator and apply those mappings to the figure text?
		(would be a much smaller set of mappings and likely more
		relevant/specific to our references??)

		I'm kind of liking this idea.
		(unfortunately, Pubtator doesn't do anatomy)

    I guess the first thing is to get an initial training pipeline built
	without NER.
...
Dear Diary:

Dec 5, 2018
    Grabbed dataset to use going forward (for now)

    Idea: should write a script to calculate recall foreach curation group.
	Note, cannot really do precision since all the papers for each
	curation group are already "Keep" (so there are no false positives)

Dec 6, 2018
    Cannot do ngrams (1,3) - blows out the memory on my Mac

    Tried not removing stopwords and using feature counts instead of binary,
    these do not seem to be helpful at this point...

    Analyzed F-scores in https://docs.google.com/spreadsheets/d/1UpMNN4Qj1Ty9pYiQ4ZBspq2fef2qTeybHZMDc35Que8/edit#gid=1787124464

    Decided best for reporting "success" is F2 with values > 0.85.

    Still using F4 during gridsearch to weigh results toward recall, but
    this is questionable.

Dec 7, 2018
    using Dec5/Proc1
    Seemingly good, out of the box pipeline:  2018/12/07-09-21-48  SGDlog.py
    alpha = 1
    min_df = .02
    ngrams = (1,2)
    stopwords = english
    binary
    Gets (test) Precision: .61, Recall: .91, F2: .83

    Looking at false negative 28228254
	some Fig legends are split across page boundaries (so only get 1st part).
	Supplemental figures won't be found because they start with "Supplemental
	Figure" instead of just "Figure".
	"mice" only occurs in figure legend text skipped because of the above
	reasons & not in title/abstract

	Maybe we should omit "mice" as a feature since these are all supposed to
	be mice papers anyway ??

	Need to look at more FP FN examples examples examples examples
    false neg 27062441
	indexed by pm2gene, not selected for anything else. Monica says she
	would not select it a relevant since it is about human protein. 
	So this is really a true negative.
	Maybe we want to remove these from the whole raw set?

    Text transformations
	Looking at Jackie's 1-cell, one-cell, ... these seem relevant.
	should add these
    Changes
	* figureText.py - to include figure paragraphs that begin "supplemental
	    fig"
	* omit doi Ids along with URLs

Dec 8, 2018
    using Dec5/Proc2

    added more feature transformations
	1-cell ...
	knock in (in addition to knock-out)
	gene trap

    gets us to precision 60, recall 92

    Need to look at Debbie's "oma" transformations.
	In the current set of features, there is only one feature: carcinoma
	- but maybe if we collapse the other oma's, this feature would boosted.
	But have to factor out glaucoma. Are there others?
	Need to investigate.
Dec 10, 2018
    Added all the MTB "oma"s to feature transformation --> "tumor_type"

    On classifier prior to these additional feature transforms, investigating
    some FN
	27062441 - predicted discard, IS actually discard - discarded after I
			grabbed the data set
	26676765 - only pm2gene - probably not relevant - check
	26094765 - only pm2gene
	12115612 - text extracted ok. Need curator to review
    looking at some FP
	28538185 - Cell Reports - fig 3 legend text lost due to funky splitting
	    across pages, but other figures intact. Need curator to reviewj
	28052251 - Cell Reports - fig 2 partially lost due to split
	28614716 - Cell Reports - fig 1,3 partially lost due to split
	28575432 - Endocrinology - just a "news and reviews" very short
	28935574 - lots of fig text missing: 1,2,4,5,7,12,13 - seems trouble
	    finding beginning of fig text (only finding "Figure x"

    On classifier with new tumor_type transformation: no improvement of
	precision and recall: 61 and 92
	interestingly "tumor_typ" is a highly negative term (???)

Dec 11, 2018 SGDlog
    I want to see how well the relevance automation works for papers
    selected by the different groups.

    Wrote tdataDocStatus.py to pull article curation statuses out of the db.
    Wrote tdataGroupRecall.py to use those statuses + a prediction file compute
	the recall for papers selected for each curation group.

    based on current predictions for the current test set:
    Recall for papers selected by each curation group
    ap_status      selected papers:  1658 predicted keep:  1584 recall: 0.955
    gxd_status     selected papers:   167 predicted keep:   166 recall: 0.994
    go_status      selected papers:  1910 predicted keep:  1796 recall: 0.940
    tumor_status   selected papers:   178 predicted keep:   132 recall: 0.742
    qtl_status     selected papers:     7 predicted keep:     5 recall: 0.714
    Totals         selected papers:  2268 predicted keep:  2082 recall: 0.918

    The smaller number of papers for tumor and GXD match the smaller number
    of papers actually chosen/indexed/full-coded in the database since
    Oct 2017.  Roughly 10% of A&P and GO.

    Makes me think we need to look at the distributions in the test/validation
    sets from two axes (at least):
	by journal - and really should be by keep/discard by journal
	by curation group selection
    and make sure they match the distributions of all the data since Oct 2017.
    (it looks like they do for curation group selection)

Dec 12, 2018 SGDlog
    Looking at papers indexed for GO by pm2geneload that have not been deemed
    relevant by any curator...
	select distinct a.accid pubmed
	from bib_refs r join bib_workflow_status bs on
			    (r._refs_key = bs._refs_key and bs.iscurrent=1 )
	join bib_status_view bsv on (r._refs_key = bsv._refs_key)
	     join acc_accession a on
		 (a._object_key = r._refs_key and a._logicaldb_key=29 -- pubmed
		  and a._mgitype_key=1 )
	where 
	(bs._status_key = 31576673 and bs._group_key = 31576666 and 
	    bs._createdby_key = 1571) -- index for GO by pm2geneload
	and bsv.ap_status in ('Not Routed', 'Rejected')
	and bsv.gxd_status in ('Not Routed', 'Rejected')
	and bsv.tumor_status in ('Not Routed', 'Rejected')

    Finds 1758 papers. Seems like these should be removed from the sample set
    because we don't know for sure that these are really MGI relevant.
    (1758 is about 3.5% of our sample papers)

dec 19, 2018 SGDlog
    Changed tdataGetRaw.py to skip the pm2gene references as above.
    Used that to grab updated date in Data/dec19.

    Retraining/evaluating get tiny improvement:
    2018/12/19-12-46-56     PRF2,F1 0.62    0.92    0.84    0.74    SGDlog.py

    Recall for papers selected by each curation group (FOR TEST SET PREDS)
    ap_status      selected papers:  1664 predicted keep:  1592 recall: 0.957
    gxd_status     selected papers:   163 predicted keep:   160 recall: 0.982
    go_status      selected papers:  1789 predicted keep:  1696 recall: 0.948
    tumor_status   selected papers:   187 predicted keep:   142 recall: 0.759
    qtl_status     selected papers:     4 predicted keep:     4 recall: 1.000
    Totals         selected papers:  2153 predicted keep:  1989 recall: 0.924

    Added initial set of cell line transformations - those that are cell line
    prefixes. No improvement.

    Realized I'd like to see what these group recall numbers look like for
    the training and validation sets. To get these, need a longer time frame
    of paper statuss. Need to update tdataGetStatus.py for earlier date range

Dec 20, 2018 SGDlog
    changed tdataGetStatus.py as above. Interesting, a little worse:
    Recall for papers selected by each curation group (FOR TRAINING SET PREDS)
    ap_status      selected papers: 17877 predicted keep: 16976 recall: 0.950
    gxd_status     selected papers:  2019 predicted keep:  1993 recall: 0.987
    go_status      selected papers: 16491 predicted keep: 15365 recall: 0.932
    tumor_status   selected papers:  1735 predicted keep:  1286 recall: 0.741
    qtl_status     selected papers:   137 predicted keep:    74 recall: 0.540
    Totals         selected papers: 21544 predicted keep: 19602 recall: 0.910

    No idea what that means!

    wrote wrapper scripts for extracting fig text and preprocessing the train,
    test, & val data files

    added cell line names (not prefixes) to transformations. No change really.

    Looking at tumor papers/stats. Why is tumor recall so low? The counts of
    papers and papers by status are very close to gxd papers.
    So it doesn't seem that the training/test sets would somehow be skewed to
    not include enough tumor papers (???). 
    However it does seem that tumor papers are harder to recognize (hence more
    false negatives).

Dec 21, 2018 SGDlog
    Updated tdataGroupRecall.py to optionally output rows that combine a
    paper's prediction with it curation statuses.
    So you can look at tumor FN or AP FN, etc.
    Not sure how much it helps.
    BUT the vast majority of tumor FN  (41 of 45) are not selected for any
    other group and the vast majority of TP are selected by other group.
    I guess this just clarifies, tumor papers are harder to
    detect. If they are relevant to any other group, they are easier to detect.
    This seems less true for GXD AP FN (just by eyeball).:w
    
    Talked with Debbie about Tumor recall. She says tumor is the only group
    that still "chooses" review papers. So there are possibly tumor selected
    review papers that are not being recognized by the classifier.
    I will work on removing review papers from the sample set and see what
    happens.

Dec 27, 2018 SGDlog
    Lots of file renaming for sanity sake.
    
Jan 2, 2019 SGDlog
    Refactored sdGetRaw.py to break out SQL parts to simply queries and support
    getting sample record counts.
    Counts of samples OMITTING review articles
    Wed Jan  2 14:37:28 2019
       1407	Omitted references (only pm2gene indexed)
      27439	Discard since 11/1/2017
      18709	Keep since 11/1/2017
       7793	Keep 10/01/2016 through 10/31/2017

    Counts of samples INCLUDING review articles
    Wed Jan  2 14:38:21 2019
       1407	Omitted references (only pm2gene indexed)
      29235	Discard since 11/1/2017
      18914	Keep since 11/1/2017
       7839	Keep 10/01/2016 through 10/31/2017

    So there are ~2000 review articles that are discards,
    and ~200 that are keep
    Omitting these helps overall P/R and tumor R some.
	Without review articles: (compare to Dec 20 above)
    Recall for papers selected by each curation group
    ap_status      selected papers:  1710 predicted keep:  1634 recall: 0.956
    gxd_status     selected papers:   183 predicted keep:   182 recall: 0.995
    go_status      selected papers:  1755 predicted keep:  1661 recall: 0.946
    tumor_status   selected papers:   147 predicted keep:   115 recall: 0.782
    qtl_status     selected papers:     3 predicted keep:     1 recall: 0.333
    Totals         selected papers:  2127 predicted keep:  1970 recall: 0.926
	Here is overall P/R for Dec 20 (w/ review papers) vs. Jan 2 (without)
    2018/12/20-13-46-18     PRF2,F1 0.63    0.92    0.84    0.75    SGDlog.py
    2019/01/02-16-08-32     PRF2,F1 0.65    0.93    0.85    0.76    SGDlog.py

Jan 3, 2019 SGDlog
    Realized that I CAN compute a precision value for individual curation
    groups. I can get papers that are "negative" for a given group by counting
    "discard" papers and papers that are rejected by the group. These should
    be ground truth negatives for the group.

    I'll convert groupRecall.py to output both, and output sets of FN and FP
    for each group so we can easily look at examples for each group that
    are wrongly predicted. (will rename the script too!)

    Hmmm. need to think about this more, I thought it made sense, but now I'm
    not so sure. (SEE JAN 4)

Jan 4, 2019 SGDlog
    Group Recall question:
	How many relevant papers will my group miss? (neg stmt)
	Really: what fraction of papers predicted "discard" should be selected
	    by my group? (neg stmt)
	Really: what fraction of group selected papers will be predicted "keep"
	    (i.e., will we look at) (pos stmt)
	To answer:
	    GRecall = group selected predicted keep / group selected

	    In terms of TP and TN:
	    GTP = predicted_keep(restricted to group selected)
	    		# (group selected => ground truth pos)
	    GFN = predicted_discard(restricted to group selected)
	    GTP + GFN = group selected
	    GRecall = GTP/(GTP + GFN) = GTP/(group selected)

    Group Precision question:
	How many papers will my group look at that we don't want? (neg stmt)
	Really:  what fraction of papers predicted "keep" will be
	    rrelevant to my group? (neg stmt)
	    (reject or will be skipped by 2nd triage report)
	Really:  what fraction of papers my group looks at will be
	    relevant to my group? (pos stmt)
	    (selected by my group)
	    BUT "what papers my group looks at" also dependes on the group's
		2ndary triage selection report. So if we cannot take that
		into account, I'm not sure how useful this is.
	To answer:
	    GPrecision = group selected predicted keep / predicted keep

	    In terms of TP and TN:
	    GTP as above + predicted_keep(restricted to group selected)
	    GFP = predicted_keep(restricted to not group selected)
		    kind of murky,
		    does "not group selected" = rejected
						or (rejected or true discard)?
		So what does GTP + GFP include?

	    GPrecision = GTP/(GTP + GFP)

	GOING TO PUNT ON THIS FOR NOW.

	Looking at Tumor FN's. Send a small batch to Debbie.
	Looking at a few (general) FP. Send a few to Debbie to look at.

Jan 7-8, 2019 SGDlog
    Debbie and Sue looked at a few tumor FN and general FP examples from the
    most recent SGDlog runs.
    Debbie: of 5 tumor FN examples from the
	29414305 (“mouse” “mice” do not occur in any of the used text fields)
	27760307 (has “mouse”)
	28430874 (no “mouse” “mice”)
	28647610
	28740119
	all 5 are actually review papers, not marked as review in MGI - 3 ARE
	marked as review in pubmed, 2 not.
	So there are some papers in MGI that have not been marked as review
	when they are in pubmed. These should be found and corrected.
	So for these examples, if we correctly filter out review papers, they
	would not be in the sample set and not be FN
	(additional FN that are reviews: 28412456, 20977690)
    Debbie/Sue looked at 4 FP
	28414311
	25362208
	27317833
	28887218
	the 1st 3 should not have been discarded (they are actually TP)
	the last is a zebrafish paper, so a correct FP
    Sent some more examples, and ones that are not review papers.

    ALSO started looking at randomforest classifier. Promising results, but
    overfitting badly. Looking at ways to stop that.

Jan 9, 2019
    Play w/ n_estimators, max_features, max_samples_split.

    Changed textTuningLib .fit() method to predict on the val set BEFORE
    retraining the bestEstimator on training + val sets.
    This SHOULD make the results on the val set similar to the test set.
    But for RF, the val set is getting similar results as the training set.
    Very weird (this was also happening when I was predicting the val set
    after retraining on training + val).
    I don't understand it.
	- looked at SGDlog.log, it has the expected behavior: val results like
	    test, even when predicting val on the retrained model
	- looked at the way val and test sets were generated to see if they
	    were somehow different, but their journal distributions are very
	    close
    So, there is something weird about RF and the val set.

Jan 10, 2019
    Learned,
    (1) with the gridsearch "refit" parameter, default is to retrain on
    the full train + val sets or train (without removed cv folds).
    So we don't need to retrain.
    (2) if when we predict val set on the trained estimator, it scores like
    the training set, probably is evidence of overfitting - adding the val
    set to train on makes it learn the val set same as rest of the train
    set. If the val set scores like test set (or intermediate), then things
    seem good.

    Restructured textTuningLib gridsearch to not redundantly retrain.

    Finally found params the stop overfitting:
	'classifier__n_estimators': [50],
	'classifier__max_depth': [15],
	'classifier__min_samples_split': [75,],
    Need to try to improve P & R now (R actually)

Jan 14, 2019
    trying various params for RandomForestClassifier.
    See https://docs.google.com/spreadsheets/d/1UpMNN4Qj1Ty9pYiQ4ZBspq2fef2qTeybHZMDc35Que8/edit#gid=1765328280
    Seems playing with Min_samples_leaf is easiest way to control overfitting.
    Seems unnecessary to go beyond 50 estimators.
    Using: 25 (or 5) max_samples_leaf and 50 estimators gives
	P: 83   R: 87  with not too bad overfitting.
	Group recall:
    Recall for papers selected by each curation group
    ap_status      selected papers:  1710 predicted keep:  1553 recall: 0.908
    gxd_status     selected papers:   183 predicted keep:   177 recall: 0.967
    go_status      selected papers:  1755 predicted keep:  1576 recall: 0.898
    tumor_status   selected papers:   147 predicted keep:   115 recall: 0.782
    qtl_status     selected papers:     3 predicted keep:     1 recall: 0.333
    Totals         selected papers:  2127 predicted keep:  1849 recall: 0.869

    I don't see how to get much better w/ random forest - but this is pretty
    good!

Jan 15, 2019
    looking into MGI vs. pubmed review status.
    Started checkRevPaper.py, ran into problems with NCBIutilsLib.py regarding
    getting xml vs. json vs. medline outputs from pubmed.
    Figured out rettype vs. retmode parameters to eutils summary and fetch
    cmds. Subtle confusion.

Jan 16, 2019
    checkRevPaper.py working for comparing "Review" setting in MGI vs. pubmed.
    Found flakeyness @ pubmed. Sometimes data for a pubmed ID will work time,
    later, the same ID gets an error, later it works again...

    initial finding:
    Out of 53938 papers, review in pubmed but not MGI: 2109
    (these are of the sample set which excluded MGI "review" papers)
    Some IDs got kicked out by pubmed.

    Starting to work on detecting review papers by finding "review" in text
    near the beginning of doc.

Jan 18, 2019
    have initial version of text searching for "review". Finding the end of
    the abstract is a bit challenging.
    Taking the last 5 words of the abstract,
    searching for this words upto len(title) + len(abstact) + N words in
    the extracted - I didn't want to convert the whole extracted text into a
    list of words, but maybe I should not worry about that. Or at least make
    N very big.
    Cannot find the end of the abstract for some papers for various reasong:
	1) words/terms are not extracted exactly right, e.g., foo-blah may
	    be extracted as fooblah
	2) N is not big enough - this seems biggest culprit, setting N=2000 
	    removes lots examples.
	3) multiple columns or other weird text flows may not put the abstract
	    text first. Other paragraphs may come 1st
	BUT still out of 1000 papers, we don't find end of abstract about 120
	times
    Ok, code cleaned up, looking at 2000 papers (not marked as review in MGI)
	Papers examined: 2000
	Marked  review in pubmed but not MGI: 106
	Appears review via text but not MGI: 93
	Appears review via text AND in MGI: 0
	Appears review via text AND in pubmed: 59

Jan 25, 2019
    Remembering where I left off. Cleaned up checkRevPaper.py.
    Added basicLib.py to autolittriage.
    Ran checkRevPaper.py on all the raw sample files (discard_after, keep_*).
	Papers examined: 53943
	Marked  review in pubmed but not MGI: 2128
	Appears review via text but not MGI: 3279
	Appears review via text AND in MGI: 0
	Appears review via text AND in pubmed: 1394
    So approx 10% of potential review papers are NOT marked as review in MGI
    (if we believe the text heuristic)

    Put into google spreadsheet:
    https://docs.google.com/spreadsheets/d/1UpMNN4Qj1Ty9pYiQ4ZBspq2fef2qTeybHZMDc35Que8/edit#gid=1956366526

    Spot checking some.
	* some "review found at ..." are matching "received/sent for review"
	    at PNAS. Probably should keep "review" if preceeded by "for"
    
    Need to write a script to remove any review docs from the raw input before
    we do sdSplitByJournal.py.

Jan 28, 2019
    After further clean up of checkRevPaper.py - in particular adding a bunch
    of exceptions where "review" does not mean a review paper,
    (Biochim_Biophys_Acta and PNAS were worst offenders) we get:
	Papers examined: 53943
	Marked  review in pubmed but not MGI: 2128
	Appears review via text but not MGI: 1672
	Appears review via text AND in MGI: 0
	Appears review via text AND in pubmed: 1382
    So this is 2418 additional "review" papers found ~5% of the sample set.
    Could be significant. (maybe?)

    NOTE in 5785 articles, we couldn't find end of abstract, we we did not
    search for "review". So there could be a chunk of review papers here.
    Would have to work harder to see if any could be gleened from this set.

    Wrote rmReviews.py to read in the review predictions file and use it
	to remove review papers from sample data files
    
    Removes ~2400 review papers from full sample set

    Running RF on FULL sample set:
    Recall for papers selected by each curation group
    ap_status      selected papers:  1710 predicted keep:  1553 recall: 0.908
    gxd_status     selected papers:   183 predicted keep:   177 recall: 0.967
    go_status      selected papers:  1755 predicted keep:  1576 recall: 0.898
    tumor_status   selected papers:   147 predicted keep:   115 recall: 0.782
    qtl_status     selected papers:     3 predicted keep:     1 recall: 0.333
    Totals         selected papers:  2127 predicted keep:  1849 recall: 0.869
    2019/01/14-14-07-40     PRF2,F1 0.82    0.87    0.86    0.85    RF.py

    Running RF on review removed sample set:
    Recall for papers selected by each curation group
    ap_status      selected papers:  1635 predicted keep:  1476 recall: 0.903
    gxd_status     selected papers:   183 predicted keep:   170 recall: 0.929
    go_status      selected papers:  1747 predicted keep:  1541 recall: 0.882
    tumor_status   selected papers:   152 predicted keep:   116 recall: 0.763
    qtl_status     selected papers:     3 predicted keep:     2 recall: 0.667
    Totals         selected papers:  2111 predicted keep:  1818 recall: 0.861
    2019/01/28-17-22-41     PRF2,F1 0.83    0.86    0.85    0.84    RF.py

    SO all recalls and other metrics dropped. Doesn't seem to help.
    Might need to retune RF params

Jan 29, 2019
    SGDlog on full sample set (from Dec 19)
    Recall for papers selected by each curation group (FOR TEST SET PREDS)
    ap_status      selected papers:  1664 predicted keep:  1592 recall: 0.957
    gxd_status     selected papers:   163 predicted keep:   160 recall: 0.982
    go_status      selected papers:  1789 predicted keep:  1696 recall: 0.948
    tumor_status   selected papers:   187 predicted keep:   142 recall: 0.759
    qtl_status     selected papers:     4 predicted keep:     4 recall: 1.000
    Totals         selected papers:  2153 predicted keep:  1989 recall: 0.924
    2018/12/19-17-07-12     PRF2,F1 0.63    0.92    0.84    0.75    SGDlog.py

    Running SGDlog on the review removed sample set:
    Recall for papers selected by each curation group
    ap_status      selected papers:  1635 predicted keep:  1546 recall: 0.946
    gxd_status     selected papers:   183 predicted keep:   179 recall: 0.978
    go_status      selected papers:  1747 predicted keep:  1626 recall: 0.931
    tumor_status   selected papers:   152 predicted keep:   120 recall: 0.789
    qtl_status     selected papers:     3 predicted keep:     2 recall: 0.667
    Totals         selected papers:  2111 predicted keep:  1937 recall: 0.918
    2019/01/29-08-48-28     PRF2,F1 0.64    0.92    0.85    0.76    SGDlog.py

    SO slight overall improvement, a little tumor recall improvement, but
    others dropped a tad.

    SO: do I try to improve the review detection code?
    Ideas:
	* (DONE) if we cannot find end of abstract text, just use
	    len(title) + len(abstract) + N as the area to search for "review"
	    words (probably most bang for buck as ~5800 samples don't find
	    eoabstract.
	* (DONE) add "commentary" to list of review words (seems PNAS
	    uses this word).
	    = NEED to EVALUATE THIS MORE
	* after discovering a "review" word exception, keep going to see if
	    we find a true "review" word later
Jan 30, 2019
    Spent some time evaluating "commentary" as a review word.
    Generally, looks good. There are a few exceptions to program in
    "see commentary".
    Asked Jackie and Debbie if they think 'commentary' is worth excluding.

    Added find/remove reviews functionality to sdBuild1Get.sh

Feb 4, 2019
    Cleaned up and commented featureTransform.py

    Decided I should verify my gut assumption that working with the full
    extracted text (instead of just fig legend text) would be too hard to work
    with.

    (vectorizerPlay.py)
    I tried just vectorizing the training set (40803 docs) - full extracted
    text. I thought it would run out of memory, but it didn't (actual training
    might).
    Took 1 hour, 24k features

    So if I tried to tune using the full text, it would take multiple hours per
    run.

    Just title+abs+figure legends:
    Took 4 minutes (or less?) and 3542 features

    Changed figureText.py to include *any* paragraph that includes "figure" or 
    "table". This increases the previous "only figure legend" text by about
    3.5 times. The full text is 2.5 times bigger.
    In Data/jan2/Leg_para/Proc1
    Vectorizing: 9357 features, 16 minutes

Feb 5, 2019
    Title+abs+figure paragraphs:
    preprocessed -p  removeURLsCleanStem
    running SGDlog.py on it
       Test  keep       0.75      0.87      0.81      2127
    improved precision around 10 points. Recall down a bit, but haven't tuned
    Tumor recall still around 78

    To keep all the versions of sample files straight, came up with a new
    dir structure. Need to change sdBuild* scripts to conform

    NEXT step:
	change figuretext.py to be a script itself with options (just legends,
	legends + paragraphs, legends + words)
	change sdBuild3Fig.sh to use the new figuretext.py instead of
	preprocess
	(basically separating the concepts of "informative text extraction"
	from preprocessing)

	SO have levels: 
	    sample article subset (raw or raw_no_reviews)
		what about no "mice" - depends on ref section removal
	    splitting into test/train/validation sets
	    informative text extraction
		figure text
		could include ref section removal
	    text preprocessing (featuretransform, ...)

	    (have to ponder. the boundaries get murky)

	Idea:
	    remove MGI or pubmed "reviews" from sample set
	    try figure/table legends + words in paragraphs (not whole
		paragraphs)
	    fix "cell line" transformation
	    tune from there
	Currently:
	    refactoring figureText.py to support figure legends, figure
	    paragraphs, and figure words
	Refactoring Ideas:
	    * trying different preprocessing options are not configurable,
		e.g., different fig text extraction algorithms - which
		extraction algorithm to call requires code changes.
		- should write a sampleDataLib builder that would be given
		  some option (maybe from config or cmd line) and instantiate
		  Samples from that Builder, which injects the correct object/
		  function.
		- would require a fair amount of refactoring, and should think
		  about this pattern in other places
		- postponing for now.

Feb 17, 2019
    Idea: recall for Tumor is bad because the number of Tumor papers in the
	training set is too small, it is swamped by the other samples (and
	maybe these papers are just harder to recognize.

	So try adding an additional set of papers selected for Tumor for
	the "keep_before" set. Try 1000 or so. See what happens. - changes
	to sdGetRaw.py
	Here are counts before these changes:

	Hitting database bhmgidevdb01.jax.org prod as mgd_public
	Sun Feb 17 16:44:28 2019
	    1392	Omitted references (only pm2gene indexed)
	    31163	Discard after 10/31/2017
	    21148	Keep after 10/31/2017
	    7792	Keep 10/01/2016 through 10/31/2017
	    Total time:  119.576 seconds

	Here are updated numbers w/ Tumor papers added
	Hitting database bhmgidevdb01.jax.org prod as mgd_public
	Mon Feb 18 08:32:59 2019
	   1392	Omitted references (only pm2gene indexed)
	  31163	Discard after 10/31/2017
	  21148	Keep after 10/31/2017
	   8951	Keep 10/01/2016 through 10/31/2017
		+ tumor papers from 7/1/2015
	Total time:   34.168 seconds
    Recall for papers selected by each curation group
    ap_status      selected papers:  1720 predicted keep:  1561 recall: 0.908
    gxd_status     selected papers:   176 predicted keep:   167 recall: 0.949
    go_status      selected papers:  1784 predicted keep:  1579 recall: 0.885
    tumor_status   selected papers:   142 predicted keep:   112 recall: 0.789
    qtl_status     selected papers:     3 predicted keep:     2 recall: 0.667
    Totals         selected papers:  2138 predicted keep:  1856 recall: 0.868

    So this improves recall a bit.

Feb 18, 2019
    added tumor papers back to 7/1/2013.
    Built cleanest dataset to date:
    feb18_nopmRevs (pubmed and MGI review papers removed)
    With add'l tumor papers, get recall in upper 80's. Yay!
    Sub directories:
	Legends		- just legends
	LegendsWords	- legends + 50 words around references to fig/tables
	LegendsPara	- legends + paragraphs containing refs to fig/tables

    Now we can compare the different extracted text approaches.

    To make it easier/quicker to add additional sample sets/files
    (e.g., keep_tumor), and have other files preprocessed, I rejiggered the
    build scripts to do fig text extraction and preprocessing BEFORE the
    test/val/train split. (that way, discard_after, keep_after, etc.) can be
    preprocessed once and reused as other files are added)

    Data/
	apr22/			# Raw refs + text pulled from database
	    discard_after
	    keep_after
	    keep_before
	    keep_tumor
	    refStatuses.txt	# article statuses per curation group
	    getRaw.log		# log from getBuild1GetRaw.py

	    Legends/		# Informative text extraction option:
	    			#  keep title + abstract + figure legends
		discard_after
		keep_after
		keep_before
		keep_tumor

		Proc1/		# Preprocessed and randomly split files
		    discard_after
		    keep_after
		    keep_before
		    keep_tumor
		    testSet.txt		# the randomly split test set
		    trainSet.txt	# ...training set
		    valSet.txt		# ...validation set
		    splitTest.log	# log from the random split process
		    LeftoversTest.txt	# leftovers in the split process
		    LeftoversVal.txt	# leftovers in the split process

		Proc2/		# if want to try different preprocessor steps
		    ...
	    LegendsWords/	# Informative text extraction option:
	    			#  keep title + abs + legends + words around
	    			#  references to figures
		...
	    LegendsPara/	# Informative text extraction option:
	    			#  keep title + abs + legends + paragraphs
	    			#  containing references to figures
		...


April 22, 2019
    Long hiatus. Got sidetracked by various things, writing annual reviews,
    working on extracted text splitting, vacation, ...

    Updated sdBuild* to get keep_tumor references - a set of additional tumor
    papers to add to the training set since tumor papers seem to be hard to
    recognize.

    While running sdFindReviews.py from sdBuild1GetRaw.sh to see if there are
    any papers marked as "review" in pubmed, but not in MGI, got problem:
	kept getting
	    Failed to reach server, reason: Bad Gateway
	    URL: 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi'

	    Also got:
	    Failed to reach server, reason: Internal Server Error
	    URL: 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?&api_key=93420e6fa0a8dcf419d7a62e185706572e08&webenv=NCID_1_203959158_130.14.22.76_9001_1556025942_2112099563_0MetA0_S_MegaStore&query_key=1&db=pubmed&retmode=json&version=2.0&retmax=500'
	when calling eulib.getPostResults() with a batch of pubmed IDs.
	totally intermittent and after a random number of batches.

	I shifted around the try..except stmts to retry the batch up to 5 times,
	This worked, but only found 35 more reviews from pubmed to remove.

	NOTE this does show that Pubmed does go back and update the "review"
	status of papers sometimes after a paper has been in Pubmed for a while.

    Not sure doing this additional "reviews" finds is worth it since Lori has
    already updated review papers in MGI from pubmed, but I wanted to
    check this. If I find this is not worth it, I can skip these steps in
    sdBuild1GetRaw.sh.

April 24, 2019
    Took a while, but finally built training, validation, test sets for data
    from apr22:
	Mon Apr 22 14:15:58 2019
	   1471 Omitted references (only pm2gene indexed)
	  31311 Discard after 10/31/2017
	  23342 Keep after 10/31/2017
	   7672 Keep 10/01/2016 through 10/31/2017
	   2648 Tumor papers 07/01/2013 through 10/01/2016

    Sample set breakdown (from Legends preprocessor)
			:      Samples     Positive     Negative   % Positive
    Validation Set      :         8834         3434         5400          39%
    Training Set        :        49467        27624        21843          56%
    Test Set            :         6637         2599         4038          39%

	Legends:	training set file size:   329,246,557
	LegendsWords:	training set file size:   851,634,837
	LegendsPara:	training set file size: 1,115,810,384

    Updated sdBuild3Pre.sh to run preprocessor steps in parallel.
    This sped things up.
April 25, 2019
    Tried the above sample sets on SGDlog and RF.
    LegendsWords seem to come out slightly better than Legends and LegendsPara

    RF comes out really well:
	LegendsWords, min sample leaf 25, n estimators 50
	2019/04/24-14-07-53
	### Metrics: Training Set
		       precision    recall  f1-score   support
	   Train keep       0.91      0.87      0.89     27648
	Train discard       0.85      0.89      0.87     21819
	  avg / total       0.88      0.88      0.88     49467
	Train F2: 0.87981 (keep)
	['yes', 'no']
	[[24118  3530]
	 [ 2353 19466]]
	### Metrics: Validation Set
		       precision    recall  f1-score   support
	   Valid keep       0.85      0.89      0.87      3402
	Valid discard       0.93      0.90      0.91      5432
	  avg / total       0.90      0.89      0.89      8834
	Valid F2: 0.87884 (keep)
	['yes', 'no']
	[[3019  383]
	 [ 549 4883]]
	### Metrics: Test Set
		       precision    recall  f1-score   support
	   Test  keep       0.84      0.89      0.86      2607
	Test  discard       0.93      0.89      0.91      4030
	  avg / total       0.89      0.89      0.89      6637
	Test  F2: 0.87930 (keep)

    Recall for papers selected by each curation group
    ap_status      selected papers:  2142 predicted keep:  1980 recall: 0.924
    gxd_status     selected papers:   207 predicted keep:   194 recall: 0.937
    go_status      selected papers:  2179 predicted keep:  1974 recall: 0.906
    tumor_status   selected papers:   187 predicted keep:   166 recall: 0.888
    qtl_status     selected papers:     4 predicted keep:     2 recall: 0.500
    Totals         selected papers:  2607 predicted keep:  2321 recall: 0.890

    Tried several different random seeds and several different train/test
    splits on LegendsWords, and got consistent results.
April 29, 2019
    (1) Updated sdGetRaw.py et. al. to have the option of NOT excluding review
    papers and not excluding non-peer review papers. Want to see how training 
    and testing work on this set since when we get PDFs, we might not have
    them categorized by these before we want to do automated tagging.
    How does training/testing compare with what we are getting currently
    WITH these restrictions?
	I populated Data/apr29_norestrict/LegendsWords/Proc1 with this data
	set and preprocessed the extracted text. Still need to do the split
	and try the training/testing.
	Ok, it did hurt recall, primarily for tumor:
    Recall for papers selected by each curation group
    ap_status      selected papers:  2166 predicted keep:  1988 recall: 0.918
    gxd_status     selected papers:   178 predicted keep:   172 recall: 0.966
    go_status      selected papers:  2188 predicted keep:  1967 recall: 0.899
    tumor_status   selected papers:   217 predicted keep:   168 recall: 0.774
    qtl_status     selected papers:     4 predicted keep:     3 recall: 0.750
    Totals         selected papers:  2676 predicted keep:  2331 recall: 0.871

    Need to think about this more???

    (2) Looked at groupRecall.py to see about computing recall by journal
    (and can do precision too for journals). This would be good to see if
    there are any outlier journals we'd need to handle specially somehow.

    To do this by journal, I added the journal as a field in the refStatus.txt
    output file (changed sdGetStatus.py and groupRecall.py (where it
    reads the status file)).
    Next step is to change/replace groupRecall to compute this by journal
    - might want to change some variable names here too,
    I found the code confusing: "true positive" has two different meanings

April 29, 2019
    generated Data/apr29_norestrict which doesn't remove review and
    non-peer reviewed articles to see how important those restrictions are.
    Get:
    ap_status      selected papers:  2166 predicted keep:  1988 recall: 0.918
    gxd_status     selected papers:   178 predicted keep:   172 recall: 0.966
    go_status      selected papers:  2188 predicted keep:  1967 recall: 0.899
    tumor_status   selected papers:   217 predicted keep:   168 recall: 0.774
    qtl_status     selected papers:     4 predicted keep:     3 recall: 0.750
    Totals         selected papers:  2676 predicted keep:  2331 recall: 0.871

    So this does hurt tumor recall significantly. 
    So we should continue to restrict our sample set by leaving these papers
    out.

May 16, 2019
    Finished subsetPandR.py - a more general version of groupRecall.py. Allows
    computation of recall/precision for papers by journal and by curation group.
    
    (using april 22, legends+words data, RF
    subsetPandR.py --journal  ../Data/apr22/refStatuses.txt  RF.Apr22/RF_test_pred.tx)
    Looking at journals, there are a couple of outliers to investigate:
	cancer lett, arch biochem biophys, am j physiol cell physiol

May 17, 2019
    Met w/ Joel, Jackie, and Richard yesterday to review where we are and what
    next.
    Summary of thoughts (roughly in order of difficulty):
    1) Look at FN from journals with lower recall and see if there is anything
	obvious going on (e.g., text extraction/finding figures is not good)

    2) Pull together a small subset of FP and FN for some curators to review

    3) Analyze confidence of predictions from SGDlog. I think we can automate
	this analysis: Bin papers by confidence score and count numbers of
	FP and FN per bin. See if higher confidence results in fewer FP/N.

	Can try the same thing for RF, but need to see if I can get conf
	scores from this model.
	IF we find that confidence matters, during the curation process,
	curators could look at low confidence discards to find FN's.
	(Note, curators would not need to do anything special with FP assuming
	they look at all predicted keepers anyway)

    4) Look at new approach: one predictor for each curation group. Takes a
	mouse paper and predicts relevance for that group, setting status to
	routed or rejected.
	Effectively, if rejected for all groups, it is a discard.
	Need to ponder this to see how easy it would be to do.

	Curator question: If this works, would we even need a discard flag
	    anymore?  I.e, is discard really equivalent to reject by all groups?
	    Isn't discard flag (currently) only necessary so 1ary triage can
	    inform 2ary triage?

    5) Can we get an estimate of curators' recall?
	We mentioned the idea of curators looking at predicted FN from these
	models, but that would be starting with FN from the predictor.
	Curators could have a FN from ANY curated discard, independent of
	what the model predicts.
	It seems to me we'd really need for curators to look at a subset
	of discards from the db and evaluate if they really should be discards.
	Not sure we could compute recall from this because we'd need an
	estimate of the curator TP too, but we could estimate a FN rate.

May 28, 2019
    looking at PMID 28701356 from Am J Physiol Cell Physiol to see if its text
    extracted ok and/or figure text extraction worked ok.

    Finding some bugs in figureText.py.
	1) in figParagraphIterator(), it doesn't find the start of a para
	    if the previous paragraph was only one char (text from tables
	    and such in extracted text can be single characters).
	    This causes it to miss a figure legend and instead think it is a
	    para that references a figure - not too serious a bug, but 
	    some of the figure legend may be omitted.
	2) the matching term "fig" or "figure", etc. is omitted from the text,
	    making it hard find things in the extracted text
    BUT it seems like the text is extracted reasonably and most of the figure
    text is found and pulled out.

    I looked at extracted text from 28835433 28701356 30156861 30110564
	all looks reasonable.
    Recall is probably low because there are so few Keep's: 5FN out of 8
	keepers in the test set.
    So I don't see any structural problem.

May 30, 2019
    Link to graph showing P & R by journal:
    https://docs.google.com/spreadsheets/d/1MQmKSkqv3rOhrD3Xxjk2uLchQx1BU25IebBwAnlLjuw/edit?pli=1#gid=3923920
    Most of the journals w/ low R have very few TP and FN (in particular), so
    their recall scores don't mean much.
    The journal w/ R < .8 and the biggest FN is j_biol_chem. I'll look at this.

June 3, 2019
    Finished cleaning up figureText.py to address the issues above.
    Wrote text2figureText.py to extract fig text from a file.
    Wrote sampleFile2Ref.py so it would be easier to look at the extracted text
	from sample files.

    FINALLY  back to looking at FN:  J_Biol_Chem is the journal with poorer
    R and not just a few FN. Here is what I found:
	1) sometimes paragraph boundaries are not appearing in extracted text so
	figure "paragraphs" would get big. Figure "words" will also pick up
	words across paragraph boundaries and figure legend starts may not be
	recognized as legends
	(this will be worse if we are doing Legends + Paragraphs instead of
	words)
	example 24451367 - look after Fig. 1D, and "Fig. 1G, left"
	example 29358324 - look at text after Fig. 1D and around Fig. 2A
	example 30156861 after "Fig. 4, C and D)". 

	I don't think these are very serious issues, so I don't think they
	explain why j_biol_chem has lower R.

June 4, 2019
    Improved to figureText.py handling of Legends+Words. Now outputs a complete
    blurb ...text figure text... rather than splitting that into two blurbs
    '...text' and 'figure text... - and it keeps the "figure" text in the blurb
    (it used to delete it). Now it is much easier to see what fig text is being
    pulled out.

    Looking at precision and recall by journal (for Random Forest).
    Item (1) from May 17
    See graphs:
    https://docs.google.com/spreadsheets/d/1MQmKSkqv3rOhrD3Xxjk2uLchQx1BU25IebBwAnlLjuw/edit?pli=1#gid=203117656
    and
    https://docs.google.com/spreadsheets/d/1MQmKSkqv3rOhrD3Xxjk2uLchQx1BU25IebBwAnlLjuw/edit?pli=1#gid=925762984

    sci_transl_med is journal w/ reasonable size FP and poor Precision.
    Looking at examples to see if text extraction, figure text extraction
	is messed.

    Looking at 29093181 it is TN
	some paragraph boundaries are missing so Legends+Words bleeds across
	some paragraphs. See "KIT activation loop mutants display affinity for
	type I inhibitors". The next two paragraphs are joined

	Also paragraph flow order is pretty weird - but probably doesn't affect
	anything (even crosses page boundaries)
    Looking at 28931657 FP
	WAS marked as discard on Apr 22, but now is not discard & is AP indexed.
	So it is really a TP not an FP
	similar issues to 29093181. Fig legend that is only recognized as a fig
	paragraph because page footer bleads into "Fig. 1."
    Looking at 28794285 FP
	(still is discard)
	Similar problems, but sometimes 2 columns of text are read as one right
	across the page. See paragraph with "fig. S1, E"
    Looking at 29118259 FP - still an FP

    nat_genet - low precision, but only has 2 keeps of 52 articles,
	predictor has 5 FP so numbers too small to be significant.

	Looked at 25362484. Just 1 example of issues. "Figure 1" is not
	recognized as legend because the 1st word in the image "Measurable"
	is extracted as the word before "Figure 1". Nothing is perfect!
June 10, 2019
    Looking into confidence scores for SGDlog (from jan 2 data) - item (3) from
    May 17 meeting.
    See
    https://docs.google.com/spreadsheets/d/1FIu615HU48mVaVF71nl7jXMVNI-MZ4qHhS2cMJLdiBQ/edit#gid=1635760950

    Bottom line:
	yes, NPV does correlate w/ negative prediction "confidence" for SGDlog,
	however since NPV is pretty high even for low confidence, not clear
	that it would be worth curator time to look at lower confidence
	predictions (i.e., curators would have to look at a log of TN to find a
	small number of FN).
	Would be curator judgement.
	UPON further looking, NPV correlates with confidence only for larger
	confidence bins. If you set the bin range to be smaller, things don't
	correlate so well. - See Aug 8 below

July 9, 2019
    Long digression.
    THE BASIC IDEA: include more fields in sample files that are not nec.
    used in training/predictions, BUT are useful for prediction analysis
    This generalizes what I've been doing w/ journal and curation group status,
    and supports better analysis
    1) Learned about google sheets pivot tables for analyzing prediction results
	for various article subsets
    2) changing sampleDataLib.py & tuningTemplate.py to add "extra info fields"
	to prediction output file (e.g. review status, curator group status, etc
    3) Lots of refactoring to sdGetRaw.py and sdBuild1Get.sh:
	a) use sampleDataLib.py for output of samples, including new extra info
	    fields. Encapsulates the sample file knowledge in sampleDataLib
	b) lots of refactoring sampleDataLib. Includes speed up of preprocessors
	c) stop creating separate refStatus file - these fields are now included
	    in sample file
	d) stop finding/removing review articles that are not marked as review
	    in MGD. Just use MGD setting.
	    Also building 2 instances of sample set, one with review papers and
	    non-peer review papers omitted, one keeping all these refs.
	    Will analyze the difference here. See thoughts below.
	e) handle schema change w/ extracted text split into multiple sections.
	    This was trickier than it sounds as there is no easy to concat the
	    sections in the right order with SQL and it was somewhat useful to
	    do the concatenation in Python anyway. Had to do fussing to get
	    SQL to run efficiently.
	    This now runs a lot faster than it used to (w/o the section split)

    INCLUDING REVIEW PAPERS and non-peer review articles in sample set.
    Pros and Cons. (and note, we could separate these two criteria)
    
    Omit)
	a) presumably most curators have not been looking at these, so maybe
	    they do not have very accurate ground truth (?),
	    Jackie, Debbie, is this true?
	b) will not be learning Tumor review papers
	c) when we go to production
	    i) Can either omit these types of papers from prediction process
		(since we will know paper types from Pubmed)
	    ii) OR predict but without having done specific training or analysis
		on these predictions.

	    maybe either of these options is ok as curators can address these
	    paper types differently. Note tumor may want to look at these more
	    carefully than other groups. But would need to think this through.
    Include)
	a) is their discard/keep assignments accurate enough?
	b) would include tumor review papers in training set
	    i) with some work, could try to include only tumor review papers in
		the training/test set. Not sure how much work this would be.

    WHEN SPLITTING SAMPLE SET into training/validation/test sets. Should we only
    include papers from MGI monitored journals in validation/test sets (as we
    have been doing up to now)? Want validation/test sets to be as much like the
    data we will be predicting as possible.

    Seems to me that would should include ALL journals in the post Lit Triage
    release paper set because this is more representative of what papers we will
    actually make predictions for. I.e., we want predictions to work for all
    papers in the lit triage directories whether from MGI journals or not.

July 15, 2019
    converted preprocessSamples.py to use sampleDataLib.ClassifiedSampleSet

July 23, 2019
    Continued improvements to sampleDataLib.py.
    Converted sdSplitByJounal to use ClassifiedSampleSet.
    Tested build of small data set (Data/smallSet).
    Tested textTuningLib.py.
July 24
    Changed subsetPandR.py to get paper "extra info" from new prediction file
    format.
    For simplicity, I got rid of --journal option and just made it work for
    curation groups.

Aug 8, 2019
    For classifiers with predict_proba(), output a "confidence" value.
    Generated data (references) from Jul 9, 2019,
	omitting isReview (from MGD) and non-peer-reviewed papers.
	including all journals (not just MGI) in the test & validation sets.
	    - sdSplitByJournal.py, sdBuild4Split.sh
	Reran RF and SGDlog on these to get comparable results w/ "extra info"

    Organized RF and SGDlog analysis google sheets and did a bunch of analysis:
	P, NPV, R by journal, by year of publication, by "confidence",
	    by text len
	Identified some journals w/ low NPV to look at.
	SGDlog P and NPV doesn't correlate with "confidence" very well.
	RF P and NPV correlates better.
	    Jim: should take an initial look

    Thinking about writing 1 classifier per group, skipping primary triage:
	Since Oct 31, 2017
	Discards: 40000 (not including those from backpopulated discards)
	    Excluding review and non-peer-reviewed papers
	    Selected = chosen or indexed or full-coded
	    (Rejects do not include discards)
		Reject	Select	Tot Rej	%Selected
	AP	4217	23696 	44217	34%		- maybe balanced enough
	GO	1079	15783*	41079	28%		- maybe balanced enough
	GXD	13391	2141	53391	4%
	Tumor	7874	1578	47874	3%
			(1801 if you include reviews)

	* GO selected = 21664 - 5881 indexed by pm2geneload

	For AP, easily balance by adding some selected papers from before
	10/31/2017
	For GO, easily balance by adding some selected papers from before
	10/31/2017

	Balancing GXD and Tumor are a little harder, but close enough:
	For GXD,   selected papers before 10/31/2017:  25210	34% selected
	For Tumor, selected papers before 10/31/2017:  14547	25% selected
						    (or 16795 including reviews)

	SO, it seems feasible to do a classifier for each group by balancing
	each dataset with selected papers from before 10/31/2017.

	One thought:
	We do already have secondary triage reports.

	In general:
	Does it make sense to use current autolittriage classifier
	    (keep/discard) followed by secondary triage report for each group?

	Fallback for specific group:
	if specific classifier for any group doesn't work too well, could
	    use keep/discard classifier followed by secondary triage report for
	    that group

	To do specific group classifier, need:
	    getRaw.py - new version parameterized by curation group
	    sdSplitBYJournal.py - probably no change needed, need to think
	    sdBuild1Get.sh - needs to call new getRaw.py

	Directory structure:
	    Data/ap		- include Readme to specify data date, etc.)
	    Data/gxd
	    ...

	    ModelAnalysis/primary	<---- move current analysis to this
	    ModelAnalysis/ap
	    ModelAnalysis/gxd
	    ...
	
Aug 12, 2019
    initial version of sdGetGroupRefs.py - to pull curation group training
	sets out of the db.
    Having some trouble verifying that I'm getting the correct number of
	selected/unselected references as I'm not getting consistent counts as
	the PWI.
Aug 13, 2019
    Thinking through the SQL. For determining selected/UNselected for AP, GXD,
    Tumor, the status of GO via pm2geneload is irrelevant.
    The tmp_omit table was intended to skip pm2geneload indexed papers.
    So I've removed this from queries for AP, GXD, Tumor.

    Comparing counts I get from PWI and sdGetGroupRefs.py:
    https://docs.google.com/spreadsheets/d/1gXrhODiIndzBq_adl0QM4P1kbeOdl50CAjakC8uCRcw/edit#gid=0
    AP, GXD, Tumor counts agree. So I feel confident about their SQL

    "Selected for GO" is more problematic. Thought I could get this by:
	take chosen, indexed, full-coded by GO
	    minus currently indexed for GO by pm2geneload
	(thinking indexed by pm2geneload means curator has not looked at yet)
    BUT the pm2geneload indexes some papers that already have been chosen by GO
    (in addition to papers that have not yet been chosen). 
    SO subtracting these removes a lot of papers that truly have been selected
    by curators.
    Instead: "not discard & current status is not Rejected and some previous
    status was chosen, indexed, full-coded by a non-pm2geneload user (actually
    chosen covers all the references). This is what I'll implement.
    Can get this from PWI by querying current status and status history

    ASIDE: as I look at these counts & SQL carefully, I think the sql to build
    the discard/keep sample sets is off too. Hopefully, these errors only
    resulted in my missing some samples that could accurately been called as a
    discard/keep.  But will need some more pondering

Aug 20, 2019
    Finished sdGetGroupRefs.py last week. Lots of refactoring from original
    sdGetRaw.py. Could/should merge sdGetRaw into the other at some point.
    Reworked sdBuilt2Fig.sh, sdBuild3Pre.sh, sdBuild4Split.sh to optionally
    handle files for curation group selected/not in addition to discard/keep.

    Running RF
    On tumor: Overfit:
    ### Metrics: Training Set
	       precision    recall  f1-score   support
    Train keep       0.88      0.91      0.90     18581

    ### Metrics: Test Set
	       precision    recall  f1-score   support
    Test  keep       0.65      0.77      0.70      1486

    On gxd: Overfit:
    ### Metrics: Training Set
	       precision    recall  f1-score   support
    Train keep       0.95      0.92      0.93     33615

    ### Metrics: Test Set
	       precision    recall  f1-score   support
    Test  keep       0.79      0.77      0.78      2518

    Not too awesome.

    Thoughts on what to do if going straight to 2nd triage doesn't work well
    for a particular group:
    1) run discard/keep (primary) first. Anything that is keep, run through
	2ndary triage report and select those.
	Would need to script this to compute P, R, NPV
    2) run discard/keep first. Train a model for selection by the group only
	on the keeps.
	See how that works. The 2nd model doesn't have to learn anything
	about discards. Would have more balanced training sets since discard is
	60% of refs.
Aug 22, 2019
    HOLD THE PHONE on the RF results from Aug 20.
    The distributions of selected/non-selected in the test/validation sets
    compared to selected_after and unselected_after are quite off for curation
    groups, especially tumor and gxd.

    See
    https://docs.google.com/spreadsheets/d/1gXrhODiIndzBq_adl0QM4P1kbeOdl50CAjakC8uCRcw/edit#gid=1390163507

    Realized sdSplitByJournal.py is messed up in the way it applies the
    split-fraction to each journal separately.
    E.g., if some journal with lots of references
    has more positives than the average journal, choosing split-fraction of refs
    from that journal will add more positives than expected.

    SO "splitting by journal" doesn't seem to make sense.
    Just flip a coin across all articles, independent of journal instead.

    SO should change/replace sdSplitByJournal.py to NOT apply the fraction to
    each individ journal.

    This problem does not seem to be in the full discard/keep data sets (compare
    ratios in "after" set and test/validation sets).
    Probably because there is a more uniform spread of positives/negatives
    across enough journals.

    BUT should rerun discard/keep analysis after changing sdSplitByJournal.py.
Aug 23, 2019
    adding sdSplitSamples.py
    add Y_POSITIVE to config
    add getNumPositives/Negatives() and getJournals() to ClassifiedSampleSet

Aug 27, 2019
    completed sdSplitSamples.py and changes to sdBuild4Split.sh.
    Re-split each curator group's dataset.
    Retrained/evaluated each curator group.  (all RF)
    Overfitting on all 4 curation groups. So will need to try addressing that.

Aug 28, 2019
    Analyzing the selected/unselected counts in the sample set vs what is in the
    Test set. Things didn't jive. E.g., for tumor:
	selected_after = 1,675 (3% of "after" set),
	but "selected" in the test set = 1,412 (19% of test set)
	(and all test set positives come from selected_after)

	Similar skews toward more positives for all the other curation group's
	dataset.

    Figured out what was going on.
	the "discard/keep" column was serving 2 purposes:
	    (1) is the value in the database for the reference
	    (2) was used to determine the "positive" class in sampleDatalib and
		scripts that use it, including textTuningLib.
		And here "keep" was being treated as "selected".
		But there are lots of articles that are "keep" but "rejected" by
		tumor (or other group), and so should be treated as "negative".
		So when training/testing, lots of articles got erroneously
		treated as positive.
	Subtle and confusing.
	For primary triage, these two purposes have the same meaning.
	For curation group prediction, they are distinct.

	So the structure of datafiles for curation group prediction requires
	another field beyond "discard/keep" -needs a separate
	"selected/unselected" field.
	This complicates ClassifiedSample and the various scripts that use it.
Aug 30, 2019
    Converting things to address the curation group problem.
	* need to subclass ClassifiedSample:
	    PrimTriageClassifiedSample
	    CurGroupClassifiedSample
	    (hopefully I can use shorter names!)
	* (backed off on this decision Sept 5)
	    decided to use config vars to specify which subclass to use and
	    what the y_values and class_names are ("discard/keep" vs.
	    "selected/unselected")
	    (could use cmd line args for various scripts, or have these valuse
	    defined in each subclass. For now, we won't go there)
	    ONE DOWNSIDE to this: each script that uses the config file can
		only operate on one sample type at a time. E.g., couldn't
		write a getRaw.py that would get both CurGroupClassifiedSamples
		and PrimTriageClassifiedSamples in one invocation.
	* requires having separate config files for different datasets
	* helpful to restructure directories
	    to put model/training code near data so they can use the
	    same config files:
	    <project home>/Train/{primary_triage|tumor|gxd|ap|go}
		a sub config.cfg goes here
		/data/date/...

Sep 4, 2019
    Restructured tumor directories to as the first g. pig.
    Figured out config files for specific curation groups.
    Factored out config file finding/reading into sklearnHelperLib.py (actually
	new utilsLib.py).
	Group specific config files take precedence.

Sep 5, 2019
    Split out ClassifiedSample into PrimTriageClassifiedSample and
	CurGroupClassifiedSample.
    Added SAMPLE_OBJECT_TYPE_NAME to config files and chged
    ClassifiedSampleSet to use this class when instantiating new
	samples.
    Moved y_values and class_names to ClassifiedSample subclasses and removed
	then from config files. better encapsulation within the classes.
Sep 11,2019
    Continue
	Refactoring ClassifiedSamples in sdGetRawCurGroups.py,
	Moving some config params to ClassifiedSample classes
	Fixing bugs
	Got sdGetRawCurGroups.py working for tumor. Moving on to preprocessing.

Sep 17, 2019
    Updated sdGetRawPrimTriage.py to work with the new subclass of
    ClassifiedSample and SampleSet.
    Have sdGetRawCurGroups.py working and changed all the sdBuild*.sh scripts
    to support primTriage and curation group datasets.

    Have created sample datasets for all:  tumor, gxd, go, ap, primtriage.
    Working on fig text extraction, preprocessing, splitting, and training
    for primtriage first.

    Started including journal name as one of the features. Realized by adding
    journal name as a token to the extracted text, it effectively does one-hot
    encoding (since each article is in exactly one journal). Figure it can't
    hurt.
    Turns out, very few journals (6) pass the min_df, max_df feature limits,
    so probably is meaningless.

Sep 18, 2019
    sdGetRawPrimTriage.py
    Have run into output from --counts differing the counts of the actual
    references retrieved. Changes:
	* commented out suppTerm from SQL since bib_workflow_data
	    has multiple records with different suppTerm values. Refs were
	    coming back multiple times w/ the diff suppTerms
	    (this fixed most of the issues)
	* added acc_accession join on COUNTS sql to make sure we return refs
	    that have pubmed IDs (this really had no effect)
	* bib_status_view has duplicate records for some
	    refs (TR 13187). This has caused only a few duplicated refs.
    Spent the better part of a day trying to figure out why the counts were
    so different.

    sdGetRawCurGroup.py sdSplitSamples.py- tumor counts are all consistent

Sep 19, 2019
    fixed sampleFile2Ref.py to use ClassifiedSample subclasses.

Sep 20, 2019
    ran RF on primTriage (from sep18 data). Get (P 84, R 87 recall a bit low)
    ap             selected papers:  3631 predicted keep:  3325 recall: 0.916
    gxd            selected papers:   328 predicted keep:   302 recall: 0.921
    go             selected papers:  3094 predicted keep:  2730 recall: 0.882
    tumor          selected papers:   222 predicted keep:   197 recall: 0.887
    qtl            selected papers:    18 predicted keep:    13 recall: 0.722
    Totals         keep     papers:  4188 predicted keep:  3657 recall: 0.873
    Predictions from RF_test_pred.txt - Thu Sep 19 12:50:28 2019

    FN example:  29603384 - Glia paper. Extracted text/figure text pretty
    garbled. See FIGURE 1 and Figure 2. Text of legends seems missing. Text
    of surround paragraphs seems to be there in place of legend text BUT it
    is not recognizing column boundaries and is reading right across the 2
    cols.
    FN example: 29359518 - text extraction seems mostly ok, but paragraph
    boundaries are not very tight. See "Figure 1" - legend bleeds into next
    paragraph.
    FN example: 29070491 - text extraction seems mostly ok, but some paragraph
    boundaries not very tight See "Fig. 1."

Sep 21, 2019
    report high/low weighted features for RF.
    output trained model to pickle file.

Sep 26, 2019
    resurrected predict.py. Lots of refactoring sampleDataLib to support
    UnclassifiedSamples and SampleSets. Much improved moving most functionality
    to SampleSet and BaseSample.
    Can now run predict.py, preprocessSamples.py, sampleFile2Ref.py on
	classified or unclassified samples (sample type as parameter).
    Looked at https://simply-python.com/tag/pdf2text/
	- PDF tool python library. requires poppler which I didn't want to fuss
	    with
	- looked at https://www.pdftron.com/ - node.js. commercial product,
	    expensive.
    Downloaded pdftotext from https://www.xpdfreader.com/download.html &
	installed on my Mac.

Sep 30, 2019
    Which options to use on this version?
    -layout: reads across columns - DON"T USE
    -simple: also reads across columns - DON"T USE
    Only use the default.

    Compared original extracted text to  new pdf2text output:
    29603384:
	Note: This is a weird paper - has fair amount of spacing between
	    text lines, so probably not a typical paper.
	orig: Has a blank line after each line of text, so each line is
	    treated as its own paragraph
	    Reads across columns where there are section headers
	    Doesn't get the columns in order in some places.
	new: Does line spacing ok
	    Each paragraph is a whole line. SO line boundaries are
		paragraph boundaries (diff from orig)
	    SO TO COMPARE figure extracted text, I'd need to alter the
	    figure text finder. <--- See Oct 1 note. we can postprocess the
		new pdftotext and add extra \n for paragraph boundaries
	    Seems to get column orders correct.
    29070491:
	More normal looking paper.
	orig: Misses most paragraph boundaries
	new: Gets paragraph boundaries right.

    Ran it on 29603384.pdf and 29070491.pdf,two FN above. Extracted text
    is quite different on the new. Seems better but hard to tell for sure.

Oct 1, 2019
    Looking at 28545845
	which has columns screwed up in orig pdftotext.
	Page 27, orig has right col first (find by searching for "fig" in PDF
	and the two extracted text files).
	New pdftotext has them in right order.
	Same difference for page 29.

    looking for examples in
    https://docs.google.com/spreadsheets/d/1DmHxixXB46BrvzLcuS4ixI3IwOlGKVYNwVvo6HnyREM/edit#gid=0
    Looking at: 29931362
	orig: messes up reference numbers in ref section. treats the numbers as
	    a separate column.
	    On page 1252, starts References on right side, then goes to left
	    column
	new: gets column order right. On page 1253, has everything in right
	    order, but misses paragraph boundaries starting at reference 33.
	    BUT it is ultimately better than orig.

    SO new pdftotext seems to do a better job with column order and other
    things.

    Questions about switching to new:
	1) it ends paragraphs with just a \n and whole paragraph is 1 line.
	    Orig ends with \n\n.
	    b) created a wrapper script that, adds an extra \n. (done)
	2) do we go back and re-extract text for all the old papers in the db?
	3) does the old (orig) text extraction issues affect the autolittriage
	    training/prediction. While the old algorithm has some problems with
	    paragraph boundaries that affect fig text detection, my guess is
	    that it doesn't affect training/prediction too much. But how to
	    verify w/o re-extracting text for some 60k articles?

    Trying to look at (3) for a bit.
	Taking FN from test set from the most recent RF training/predictions.
	Created newPreds script. Given PDF, re-extract text, get figure text,
	preprocess, run predict.py, see if there is any change in predictions.

	For example 29359518. - no change. still predicted discard.

	Also no change for 28088781 and 29070491.

	Conclusion: (based on n=3) it doesn't seem worth re-extracting text
	    for the thousands of PDFs to retrain.
Oct 8, 2019
    created TR for new pdftotext:
    http://wts.informatics.jax.org/searches/tr.detail.cgi?TR_NR=13190

    Tried SGDlog on primTriage
               precision    recall  f1-score   support

    Test  keep       0.80      0.85      0.82      4188

    Recall for papers selected by each curation group. 9694 papers analyzed
    ap             selected papers:  3631 predicted keep:  3197 recall: 0.880
    gxd            selected papers:   328 predicted keep:   303 recall: 0.924
    go             selected papers:  3094 predicted keep:  2685 recall: 0.868
    tumor          selected papers:   222 predicted keep:   186 recall: 0.838
    
    Not too bad, but a little worse than RF

    Few FP/FN's looked at.
    30274781 is on both RF and SGDlog FP - asking Debbie if it is
	correctly marked as discard
    28855256 is on both as FP - it's a historic review, therefore discard, but
				is "peer reviewed" and looks relevant
    28088781 is on both as FN - text looks ok

    Upgraded Anaconda2 to the latest version. Seemed to fix bug with Gradient
    Boosted tree classifier when it died expecting a non-sparse numpy matrix.

Oct9, 2019
    Latest version of Anaconda2 seems to have bug/warning when using n_jobs>1.
    Changed textTuningLib.py to take n_jobs for GridSearchCV from config file.

    Running GB training, some promising results. Need to read more on tuning.
Oct 10, 2019
    GB: for 150 estimators and learning_rate = 0.1, get:
    2019/10/09-13-06-54	PRF2,F1	0.86	0.87	0.87	0.87	GB.py
    Good recall for individual groups:
    Recall for papers selected by each curation group. 9694 papers analyzed
    ap             selected papers:  3631 predicted keep:  3288 recall: 0.906
    gxd            selected papers:   328 predicted keep:   303 recall: 0.924
    go             selected papers:  3094 predicted keep:  2709 recall: 0.876
    tumor          selected papers:   222 predicted keep:   201 recall: 0.905
    qtl            selected papers:    18 predicted keep:    11 recall: 0.611
    Totals         keep     papers:  4188 predicted keep:  3629 recall: 0.867

    for 200 estimators:
    2019/10/10-09-47-49	PRF2,F1	0.87	0.87	0.87	0.87	GB.py
    Recall for papers selected by each curation group. 9694 papers analyzed
    ap             selected papers:  3631 predicted keep:  3296 recall: 0.908
    gxd            selected papers:   328 predicted keep:   304 recall: 0.927
    go             selected papers:  3094 predicted keep:  2721 recall: 0.879
    tumor          selected papers:   222 predicted keep:   203 recall: 0.914
    qtl            selected papers:    18 predicted keep:    11 recall: 0.611
    Totals         keep     papers:  4188 predicted keep:  3646 recall: 0.871
    Predictions from GB_test_pred.txt - Thu Oct 10 12:02:09 2019

    So 200 a bit better than 150

Oct 18, 2019
    Finished reading Andrew Ng's online book.
    Read: https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/

    Lots of rethinking my handling of validation and test sets.
    Realized:
	(a) by print reports, FN, FP, etc. on the test set, I was really tuning
	    the classifiers toward the test set. Should be the validation set.
	(b) really should save test set completely to the end and never use
	    it to decide what to try next. Should be the validation set.
	(c) So what do we use test set for?
	    (i) periodically compare test and validation set results.
		If they diverge by much, could be overfitting validation set
		or some other problem. May need to regenerate the sets.
	    (ii) at the very end, after we are satisfied w/ validation set
		results, use it to report precision, recall, etc.
		These values should be what we expect to happen on future data.
    So I made a more careful split in textTuningLib.py, particularly the fit()
    method, for valSetEstimator and testSetEstimator (rather than just
    best_estimator_) AND rejiggered everything to use validation set instead
    of test set. In this approach, the test set is optional.

    Also optimized the fit() method for the case when the pipeline parameters
    to try have only one permutation. I do this a fair amount to try things out.
    The way GridSearchCV works, it runs many unnecessary training steps.
    Doing this hurt my head, but I feel confident that I have it right now -
    famous last words).
    For testing, Train/smallTest is super great.

    Cleaned up metrics reporting and index file (include NPV)
Oct 23, 2019
    Implemented #meta data in sample files so each file can specify its Sample
    object type.
    Scripts that were changed to take advantage of this:
	(done) sampleFile2Ref.py
	(done) sdSplitSamples.py
	(done) sdGetRawPrimTriage.py	- write #meta line
	(done) sdGetRawCurGroups.py	- write #meta line
	(skip) textTuningLib.py - gets sampleobjtype from config file, ok
	(done) preprocessSamples.py
	(done) predict.py
    (a few other small bugs were found and corrected)

Oct 28, 2019
    Predict.py now outputs additional sample info for each prediction.
    This is identical to what textTuningLib.py outputs for predictions.
    Tweaked prediction output format, column orders.

    Tried using RF classifier as initial step in GB, thinking if I start with
    a pretty good classifier, maybe GB can improve it some.

    Ran into bug using GB init= param:
    https://github.com/scikit-learn/scikit-learn/issues/12429
    Trying a workaround described there. Seems to work, need to try it on
    full primary triage data set.

    Learned about versions of scikit-learn:
    https://scikit-learn.org/dev/whats_new.html
    0.20.3 is what I have installed from Anaconda. (small bugs fixed in 0.20.4)
    0.21.0 is the next version, and it drops support for python 3.4 and below.
    *** So need to switch to python 3 to start getting any new fixes. ***
    In particular, the above bug is fixed in 0.21.0

Oct 29, 2019
    Changed subsetPandR.py to use new/tweaked format of prediction files.

    Added option to textTuninglib.py to not include confidence calculations in
    prediction output files - this saves extra prediction steps during the run.

    Using RF classifier as initial step in GB improved things only slightly,
    No RF:
    2019/10/29-09-53-44	F2PRNPV	0.8730	0.8664	0.8747	0.9054	GB.py
    vs. RF:
    2019/10/29-15-50-21	F2PRNPV	0.8740	0.8668	0.8758	0.9061	GB.py

    have only tried one other tuning param with worse results.

Nov 4, 2019
    using GB with RF classifer as initial step and w/ learning rate .5, get:
    THIS IS PRETTY GOOD.    HOWEVER, this is overfitting.

    See:  2019/11/04-11-38-38  GB.py
    ### Note: init param: RF w/ n_estimators=50, min_samples_leaf=15

    Train (keep) F2: 0.9313    P: 0.9453    R: 0.9278    NPV: 0.9164
    Valid (keep) F2: 0.8849    P: 0.8633    R: 0.8905    NPV: 0.915

    Trying the GB param tuning approach described above
    https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/

    Added optional note to textTuningLib.py to be part of the report output.
    Added extractPipelineStep.py to pull a trained classifier out of pickled
	Pipeline file so it can be used as the init step of GB.

    NOTE: realized it doesn't make sense to use a pickled RF classifier.
    The original idea was to have the new RF.fit() method not have to be called
    during training of the outer GB classifier BUT the training set may be
    different, particularly if we are doing cross validation or when it is
    time to train to predict the training and validation sets.

Nov 6, 2019
    Refactored getBestParamsReport and getGridSearchReport in textTuningLib.py.
	in particular, gridSearchReport now includes scores for each pipeline
	paramater permutation.

    1st step of GB param tuning:  ### Start Time 2019/11/05-20-08-36  GB.py
	- seems to really want more n_estimators
	- will try higher learning rate to find a reasonable # of estimators

Nov 7, 2019
    Working through the GB param tuning approach. VERY SLOW, long tuning runs.
    Over 3 hours. 
    Setting things up to run tuning runs on our linux servers:
	Installed Anaconda for 2.7 in my home directory,
	Had to set things up so I can use bash instead of tcsh since I was
	    getting shell errors when trying to activate Anaconda
	    (something I've wanted to do for a long time anyway)
	Put autolittrage/Training/ directories into github so I can easily
	    move tuning files/results between servers and laptop
	Copied data sets to linux.
    Have tried a few tuning runs on linux. May be a little faster. (not clear)
    Linux Anaconda version does support more parallel jobs, but so far,
	my attempts actually take more time when running things parallel
	than sequentially.  Not sure what is up there.
	When I tried 4 jobs, ran out of memory and the jobs were killed.

Nov 13, 2019
    Completed the GB param tuning approach. Good results, see GB_analysis.
    Using:
	init param: RF n_estimators=50, min_samples_leaf=15
	max_depth = 3
	max_features = 0.7
	max_leaf_nodes = None       # unlimited leaf nodes
	min_leaf_samples = 150
	min_samples_split = 600
	subsample = 0.85

    Four best
    (1) 2019/11/11-18-03-03     F2PRNPV 0.8879  0.8718  0.8921  0.9176  GB.py
        -->  val F2 = 0.8879                train F2 = 0.9209  2405.15 seconds
    learning_rate 0.1 n_estimators  = 800

    (2) 2019/11/11-20-10-02     F2PRNPV 0.8887  0.8740  0.8924  0.9180  GB.py
        -->  val F2 = 0.8887                train F2 = 0.9210  3882.87 seconds
    learning_rate: 0.05 n_estimators: 1600
    (not going to get much better than this!)

    (3) 2019/11/12-13-34-15     F2PRNPV 0.8892  0.8739  0.8931  0.9185  GB.py
        -->  val F2 = 0.8892                train F2 = 0.9214  6967.69 seconds
    learning_rate: 0.025 n_estimators: 3200
    (best but really slow)

**  TURNS OUT NOT STARTING WITH RF IS BETTER:
    (4) 2019/11/13-10-21-30     F2PRNPV 0.8895  0.8737  0.8935  0.9188  GB.py
	-->  val F2 = 0.8895                train F2 = 0.9165
    SO BETTER and a little less overfitting.

    Have also tried RF (alone), SGDlog, SGDlsvm, SVM (NuSVC) is REALLY SLOW
	none are getting close to these GB scores.

Nov 14, 2019
    ran best GB on test set::::
    ### Start Time 2019/11/13-15-32-17  GB.py
    Train (keep) F2: 0.9178    P: 0.9374    R: 0.9131    NPV: 0.9002
    Valid (keep) F2: 0.8888    P: 0.8731    R: 0.8928    NPV: 0.9182
    Test  (keep) F2: 0.8809    P: 0.8746    R: 0.8825    NPV: 0.9100

    Recall for papers selected by each curation group. 9694 papers analyzed
    ap             selected papers:  3631 predicted keep:  3319 recall: 0.914
    gxd            selected papers:   328 predicted keep:   304 recall: 0.927
    go             selected papers:  3094 predicted keep:  2765 recall: 0.894
    tumor          selected papers:   222 predicted keep:   206 recall: 0.928
    qtl            selected papers:    18 predicted keep:    15 recall: 0.833
    Totals         keep     papers:  4188 predicted keep:  3696 recall: 0.883

    So this looks great.

    Ran NuSVC (SVM) - mostly default params and got:
    Valid (keep) F2: 0.8615    P: 0.8553    R: 0.8631    NPV: 0.8967
    Not bad (but not as good as tuned GB).
    But it took over 7 hours on server, w/o training test set predictor.
    I don't see how to tune it better particularly given how long runs take.

    Started pulling GB validation set predictions in G. spreadsheet and
	analyzing a bit.
Nov 17, 2019
    Looking at GB validation set predictions.
    Found several Journals w/ low NPV or recall and a fair number of FN.
    Examining some of these FN to see if there are structural problems with
    the extracted text:
	Biochem_Biophys_Res_Commun
	    28837808: many paragraph boundaries are missing in extracted text,
		so some figure text bleeds across to other paragraphs, some
		actual figure legends are truncated because the fig text
		processor doesn't know it is doing a legend.
	    28676401: Roughly the same, but seemed to find figure legend starts
		better.
	    29621546: Same thing
	Nat_Struct_Mol_Biol
	    20972448: same as above (but not too bad)
	    21478863: same as above (but not too bad)
	Nature
	    25470062: paragraphs containing "fig" are ok, but all figure
		legends are lost or severely truncated. They start with
		"Extended Data Figure." and sometimes water mark "Author
		manuscript" in the page margins are included.
	    16452984: paragraph boundaries a little bled through, figure
		legends are intact
	    25119048: a little bleed through. Legends often are split into
		2 cols at bottom of page, the fig processor leaves out
		2nd col. "Extended Data Figure n" are not caught as the start
		of legends.
	Structure
	    16472755: a little bleed through boundaries, but legends intact
		and generally not bad
	    23850451: a little bleed through, some legends treated as in the 
		middle of paragraphs

    Bottom line: there is some munging of text, perhaps the new pdftotext will
    improve things. Would like to try using the new extractor on some FN's but
    automating the downloading of PDFs, extraction, reconstructing the sample
    records, etc. is too complicated to bother.

    Looked at several articles that are FN and have small text length, these
    tend to be correspondence or editorials.

    Modified figureText.py to recognize "extended data figure" and
    "online figure" as the start of figure legends. These are common in Nature.

Nov 19, 2019
    * Resplit the sample data and reran training/predictions:
    2019/11/18-16-54-38  F2PRNPV 0.8786  0.8721  0.8803  0.9060  GB.py split2
    Not quite as good as the original split, but close enough

    Note that the results for the test set were a little better than val set:
    Valid (keep) F2: 0.8786    P: 0.8721    R: 0.8803    NPV: 0.9060
    Test  (keep) F2: 0.8833    P: 0.8774    R: 0.8847    NPV: 0.9128

    3rd split:
    Valid (keep) F2: 0.8762    P: 0.8613    R: 0.8800    NPV: 0.9105

    Relatively consistent

    * In looking at the feature rankings, I noticed that only some 800 of 6700
    features have non-zero weights. So I tried limiting the features to 3000,
    but the results dropped off as many of the weighted features are not in 
    the top 3000 features (by frequency).
    Valid (keep) F2: 0.8777    P: 0.8669    R: 0.8804    NPV: 0.9057
    Just thought I'd try.

    * pulled the 48898 backpopulated references from the db and ran the GB
	classifier on them. The vast majority are discard: 48455 vs 444 (keep)
	(probably off on line counts because header line may be counted).
	TP 399    FN 45
	FP 6470   TN 41985
	So
	NPV is > .99 (reassuring)
	Recall of keepers is .89
	Recall of discard is .87

	I guess this seems all right.
Dec 5, 2019
    I was getting ready to retrain the GB pipeline on all the sample data from
    sep18 so I can try predicting on more recent triaged data.
    I wanted to re-extract figure text from the raw sample files since I made
    some tweaks to the fig text algorithm.
    BUT I found that sep18/discard_after is truncated. Somehow when I was
    playing with the data files (maybe testing the new fig text extraction?),
    I managed to replace discard_after with some test file I was working with.
    Sigh.
    So I temporarily added a where clause for create_date <= "09/18/2019" to
    sdGetRawPrimTriage.py for discard_after and refreshed this file.
    Since there has been some curation/triage since then that might have
    affected some papers, this list of discard_after is probably slightly diff
    from the one actually gotten on sep18.
    I checked that there are no duplicates in the new discard_after and the
    keep_after.
    Retrained and re-predicted on the test set.
Dec 19, 2019
    finished some stuff...
    Pulled discard and keep papers after sep18 from the db. These are a new test
    set.
	discards: 2276
	keeps:   1922
	TP 1597
	TN 2123
	FP 154
	FN 324
	So: R .83, P .91, NPV: .867
	Why is Recall so low?
	.

Jan 3, 2020
    In MLtextTools, moved most training/evaluation reports from textTuningLib
    to tuningReportsLib so they can be called from predict.py.
    Want to use these to ponder why Recall above is poor.

    Have rerun predictions, no changes from Dec 19.

	### Metrics: Preds
		       precision    recall  f1-score   support

	   Preds keep       0.91      0.83      0.87      1921
	Preds discard       0.87      0.93      0.90      2277

	    micro avg       0.89      0.89      0.89      4198
	    macro avg       0.89      0.88      0.88      4198
	 weighted avg       0.89      0.89      0.89      4198

	Preds (keep) F2: 0.8463    P: 0.9121    R: 0.8313    NPV: 0.8676

	['keep', 'discard']
	[[1597  324]
	 [ 154 2123]]

    ### Recall by group
    Recall for papers selected by each curation group. 4198 papers analyzed
    ap             selected papers:  1566 predicted keep:  1395 recall: 0.891
    gxd            selected papers:   111 predicted keep:   100 recall: 0.901
    go             selected papers:   367 predicted keep:   232 recall: 0.636
    tumor          selected papers:   108 predicted keep:    75 recall: 0.694
    Totals         keep     papers:  1921 predicted keep:  1597 recall: 0.831
    Predictions from predictions.txt - Mon Jan  6 13:23:24 2020

    ### Recall by group for just papers published in 2018 and 2019
    ###  improvement because NPV for previous years is lower
    ###  almost all papers added for 2017 or earlier are keeps.
    Recall for papers selected by each curation group. 4198 papers analyzed
    ap             selected papers:   992 predicted keep:   910 recall: 0.917
    gxd            selected papers:    97 predicted keep:    87 recall: 0.897
    go             selected papers:   300 predicted keep:   200 recall: 0.671
    tumor          selected papers:    88 predicted keep:    65 recall: 0.739

    (for 2019 papers only:    P: 0.86	R: 0.83    NPV: 0.92)

    Prediction spreadsheet is here:
    https://docs.google.com/spreadsheets/d/1Xe0w9qYt2DVgBhLnz3qINaMdMbBgC2Ngw614ssSRT40/edit#gid=1038801569

    Not clear if there are any patterns to the FN's.
    But some journals have very short abstract/text lengths - a few to
	investigate

Jan 6, 2020
    looking at a few examples from the above predictions:
    Probably manually added (by Cindy/JEO):
    8025756 - FN - is 1994 paper added to db in 11/2019. PDF is just an image,
	so no fig text
    4730821 - FN - is 1973 paper added to db in 11/2019. PDF is just an image,
	so no fig text
    1709419 - FN - is 1991 paper added to db in 11/2019. PDF is just an image,
	so no fig text
    9041544 - FN - is 1997 paper added to db in 11/2019. PDF is just an image,
	so no fig text
    8199347 - FN - is 1994 paper added to db in 11/2019. PDF is just an image,
	so no fig text
    3204174 - FN - is 1984 paper added to db in 11/2019. PDF looks like just
	an image, but 58 chars of fig text. Legends were not detected due to 
	extraction.
    4655832 - FN - is 1972 paper added to db in 11/2019. PDF looks like just
	an image, but 126 chars of fig text. Legends were not detected due to 
	extraction.
    Similar: 7645839

    Probably manually added by Debbie:
    2004368 - FN - 1991 - tumor, short abstract and fig text

    IF we had a special way to load papers we "definitely want" and avoid the
    prediction/primary traige, these would not be FN's

    Other:
    27910161 - FN - no fig text - all "Figure" labels/references include
	the actual number, e.g., "Figure2a"
    31432711 - TN - no fig text(correctly), it is actually an editorial but
	Pubmed PT says "Journal Article" which a presume we default to peer
	reviewed

    Looking at FN and NPV by year
    https://docs.google.com/spreadsheets/d/1Xe0w9qYt2DVgBhLnz3qINaMdMbBgC2Ngw614ssSRT40/edit#gid=590639105
    NPV is bad for 2018 and below. Almost all these papers are keep.
    (for 2018: 23 are discard, 252 are keep)
    Maybe simple thing is older papers don't go through the prediction process,
    just default these to "keep"
    (roughly same idea as above)

    NPV for 2019 is 92 - pretty damn good.
	If we just look at 2019: 2067 TN, 180 FN.
	So curators would skip looking at 2247 papers, and only miss 180.
	Maybe this is the reasoning to look at NPV over recall.
	Recall does not take TN into account at all.
	If you have NPV and you see how many predicted Negatives you get, you
	can state your savings (TN skipped) and your cost (FN not curated)

------------------------------------------------------
So in Summary - for GB_blessed:

Performance for validation & test sets:
2019/11/13-10-21-30     F2PRNPV 0.8895  0.8737  0.8935  0.9188  GB.py noinit1
2019/11/13-15-32-17     F2PRNPV 0.8888  0.8731  0.8928  0.9182  GB.py noinit+TS
2019/11/18-16-54-38     F2PRNPV 0.8786  0.8721  0.8803  0.9060  GB.py split2
    Test  (keep) F2: 0.8833    P: 0.8774    R: 0.8847    NPV: 0.9128
2019/11/19-11-44-29     F2PRNPV 0.8762  0.8613  0.8800  0.9105  GB.py split3
2019/12/05-14-59-38     F2PRNPV 0.8793  0.8706  0.8815  0.9135  GB.py
    Test  (keep) F2: 0.8757    P: 0.8825    R: 0.8740    NPV: 0.9047

    Recall for papers selected by each curation group. 9838 papers analyzed
    ap             selected papers:  3701 predicted keep:  3369 recall: 0.910
    gxd            selected papers:   343 predicted keep:   323 recall: 0.942
    go             selected papers:  3170 predicted keep:  2821 recall: 0.890
    tumor          selected papers:   235 predicted keep:   202 recall: 0.860
    Totals         keep     papers:  4253 predicted keep:  3717 recall: 0.874
    Predictions from GB_test_pred.txt - Thu Dec  5 17:57:44 2019

    Recall for papers selected by each curation group. 13087 papers analyzed
    ap             selected papers:  4794 predicted keep:  4377 recall: 0.913
    gxd            selected papers:   459 predicted keep:   434 recall: 0.946
    go             selected papers:  4019 predicted keep:  3599 recall: 0.896
    tumor          selected papers:   328 predicted keep:   295 recall: 0.899
    Totals         keep     papers:  5493 predicted keep:  4842 recall: 0.881
    Predictions from GB_val_pred.txt - Thu Dec  5 17:57:44 2019
---------------------
Retrained GB_blessed on whole sample set from Sep 81 (train+val+test):
When predicting new data, triaged between Sep 18 and Dec 19:

Preds (keep) F2: 0.8463    P: 0.9121    R: 0.8313    NPV: 0.8676
    ### Recall by group
    Recall for papers selected by each curation group. 4198 papers analyzed
    ap             selected papers:  1566 predicted keep:  1395 recall: 0.891
    gxd            selected papers:   111 predicted keep:   100 recall: 0.901
    go             selected papers:   367 predicted keep:   232 recall: 0.636
    tumor          selected papers:   108 predicted keep:    75 recall: 0.694
    Totals         keep     papers:  1921 predicted keep:  1597 recall: 0.831
    Predictions from predictions.txt - Mon Jan  6 13:23:24 2020

Not quite as good.
But if we omit older year papers from predictions in one of two ways discussed
above, things improve to (from spreadsheet above)
    P: 0.86	R: 0.83    NPV: 0.92
and
    ### Recall by group for just papers published in 2018 and 2019
    ###  improvement because NPV for previous years is lower
    ###  almost all papers added for 2017 or earlier are keeps.
    Recall for papers selected by each curation group. 4198 papers analyzed
    ap             selected papers:   992 predicted keep:   910 recall: 0.917
    gxd            selected papers:    97 predicted keep:    87 recall: 0.897
    go             selected papers:   300 predicted keep:   200 recall: 0.671
    tumor          selected papers:    88 predicted keep:    65 recall: 0.739
------------------------------------------------------

Jan 13, 2020
    Reanalyzed the confidences for the negative predictions for the GB val set
    spreadsheet and predictions for triaged papers between Sep 18 and Dec 19.
    Noticed that the vast majority of FN are low confidence negative preds!
    I restructured Conf FN/FP PT pivot table tab and grafted Conf x FN/TN.
    Roughly, for both sets of predictions, if curators were to focus on the
    15% of negative predictions w/ lowest confidence,
    we'd recover 50-70% of FN (turning them into TP).
    Changes our R and NPV up into the 90's.
    So this is a really good aspect of the GB predictions.

Jan 27, 2020
    created google sheet to analyze the test set predictions (as opposed to
    the validation set predictions). 
    https://docs.google.com/spreadsheets/d/1QYvP1SBKUdbCmF_sg7-h5LNoyyJukjeKGvTv-KP_v8U/edit#gid=1446446491
    Happily it is consistent with the validation set results and with the
    results for papers triaged since the Sep 18 data set.
    The clustering of the FN in the low confidence neg preds is consistent.

Feb 13, 2020
    Did everyone meeting on Feb 11, presenting results for the GB classifier.
    Harold brought up the fact that the littriage_go loader has added papers
    selected for GO that may not fit our manual triage criteria. So maybe we
    want to remove them from the training/sample set?
    Did some queries:
	1177 papers in the db created by littriage_go
	 346 of these have actually been selected/indexed/curated by AP/MTB/GXD

	so there are not TOO many and some would have been selected by other 
	curators anyway. So I'm not inclined to worry about removing them.
	However, it would be pretty easy to add papers selected ONLY for GO
	that are created by the littriage_go user to the tmp_omit table
	in sdGetRawPrimTriage.py (831 papers).

	Some of the "challenging" journals may be primarily for littriage_go
	loaded papers however. Probably worth omitting and rerunning everything.

    Connie also points out that recall score for GO papers is not necessarily
    based on accurate data. The GO group is not actively doing 2ndry triage, so
    there are papers in the db that would be GO relevant but aren't marked as
    such.  But I don't think this affects my group recall calculation per se 
    since I'm only calculating recall on the papers that ARE marked as selected.

Mar 4, 2020
    getting set up to convert to Python3 on my Mac.
    Installed the latest Python 3 Anaconda version.
	It installed under ~/opt/anaconda3 instead of ~/anaconda3
	It also munged the current ~/anaconda2 install somehow so that
	    ~/anaconda2/bin/python didn't run correctly - I had this before.
	Turns out this is a known bug in conda, see
	https://stackoverflow.com/questions/19825250/after-anaconda-installation-conda-command-fails-with-importerror-no-module-na
	
	I reinstalled anaconda2 after renaming the anaconda3 directory and
	removing all permissions so that the anaconda2 install would not munge
	the anaconda3 install.
	This got installed under ~/opt/anaconda2.
    Created ~/Activate_conda script to be sourced so you can select 2 or 3 and
	modified .bash_profile to use this.
	Effectively this moved all PATH and PYTHONPATH settings to
	Activate_conda so it can be done w/ respect to 2 or 3.
	(at least for now during 2 to 3 transition)
    Dev environment:
	Want to be able to switch back and forth 2/3 during development.
	Will create new python3 branch for autolittriage and MLtextTools.
	Will use new dev working directories autolittriage3 and MLtextTools3
	    for the python 3 work and keep the existing working dirs for 2
	Will use ~/lib/python3 for any MGI libs (empty to start with) and
	    ~/lib/python for the 2 versions
	Will save the sdGetRaw..py scripts for the end as those will need
	    db.py. I'll try to get everything else working in 3 1st.

April 17, 2020
    Finished converting autolittriage and MLtextTools to python3.

    In MLtextTools, I did not convert comparePipeline.py getRandomSubset.py
    and populateTrainingDirs.py since I have not been using these lately and
    they are out of date anyway.
    In autolittriage, I did not convert anything under Backpopulate. But I did
    complete the removal of ModelAnalysis/, moving all models under Train/

    This included converting NCBIutilsLib.py and simpleURLLib.py to python3.
    These are currently part of the pdfdownload product and I have an identical
    copy in my personal lib_py_jak repo.

Oct 2, 2020
    Have installed latest Anaconda on
        our servers: ~/anaconda3
        and my Mac:  /opt/anaconda3
        This is: anaconda-2020.07-py38_0, python3.8.3, sklearn0.23.1
        On our servers, I used pip to locally install latest psycopg2 for
            python 3.8 (this is under ~/.local)
            but in ~/lib/python3, I'm not differentiating our libraries for
            python 3.7 vs 3.8. I could do that, but I'm presuming things should
            work with both versions.
    Modified sdGetRawPrimtriage.py to
        a) omit papers loaded by littriage_go if they have not been selected
            by a group other than GO
        b) added end date for training set selection (12/31/2019), and changed
            the START_DATE to 1/1/2017 to lower the number of keep_before
            articles selected.
        Generated a new training set:
            Fri Oct  2 09:57:41 2020
               2884 Omitted references (GOA loaded or only pm2gene indexed)
            Omitting review and non-peer reviewed articles
              41207 Discard after: 10/31/2017 - 12/31/2019
              32760 Keep after: 10/31/2017 - 12/31/2019
               5715 Keep before: 1/01/2017 - 10/31/2017
               2825 Tumor papers: 07/01/2013 - 1/01/2017

Nov 2, 2020
    On my Mac, I had to install the latest psycopg2_binary-2.8.6 with pip,
        and it installed it under /opt/anaconda3/python3.8/site-packages
        (earlier version did not with work with python3.8)

Nov 5, 2020
    Starting preprocessing and retraining on this data new dataset.

Nov 6, 2020
    Took care of this call in MLtextTools/tuningReportsLib.py:

    /opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py:68: FutureWarning: Pass beta=2 as keyword args. From version 0.25 passing these as positional arguments will result in an error
      warnings.warn("Pass {} as keyword args. From version 0.25 "

    changed sdGetRawPrimTriage.py to:
        - omit discard refs that have MGI:Mice_in_references_only tag
        - add test_2020 output set of recent articles for a held out test set
    Regenerated nov2020 training and test sets.

Nov 9, 2020
    Retrained GB.py with the production data from Nov 6 (all samples from
    discard_after, keep_after, keep_before, keep_tumor).
    Predicted on the test_2020 held out set.

    Data set summary:
    -----
    Fri Nov  6 15:35:01 2020
       6338 Omitted refs
            (GOA loaded or only pm2gene indexed or MGI:Mice_in_references_only)
    Omitting review and non-peer reviewed articles
      39899 Discard after: 10/31/2017 - 12/31/2019
      32876 Keep after: 10/31/2017 - 12/31/2019
       5715 Keep before: 1/01/2017 - 10/31/2017
       2825 Tumor papers: 07/01/2013 - 1/01/2017
      22965 Test set from 2020
    -----

    test_2020 prediction results:
    -----
    ### Metrics: Preds
                   precision    recall  f1-score   support

       Preds keep       0.85      0.90      0.88     11112
    Preds discard       0.90      0.86      0.88     11853

         accuracy                           0.88     22965
        macro avg       0.88      0.88      0.88     22965
     weighted avg       0.88      0.88      0.88     22965

    Preds (keep) F2: 0.8891    P: 0.8536    R: 0.8985    NPV: 0.8999

    ['keep', 'discard']
    [[ 9984  1128]
     [ 1712 10141]]
    --------
    Prediction analysis: https://docs.google.com/spreadsheets/d/16-PKnKrpiyb-Ct0b3lzskEamNealDNWaaXPEcPrghfg/edit#gid=1446446491

Nov 10, 2020
    Problems with loading psycopg2 in anaconda python after switching mgiconfig
    to use postgres 12.
    Here is error:

bhmgiapp14ld AC3 ~: . work/mgiconfigApp14ld/master.config.sh
bhmgiapp14ld AC3 ~: /opt/anaconda/bin/python
Python 3.8.3 (default, Jul  2 2020, 16:21:59) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import psycopg2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jak/lib/python3/site-packages/psycopg2/__init__.py", line 51, in <module>
    from psycopg2._psycopg import (                     # noqa
ImportError: /lib64/libgssapi_krb5.so.2: symbol krb5_ser_context_init, version krb5_3_MIT not defined in file libkrb5.so.3 with link time reference

    The relevant lines in the server config are:

#POSTGRES_HOME                   /usr//pgsql-12
POSTGRES_HOME                   /usr/local/pgsql
LD_LIBRARY_PATH         ${POSTGRES_HOME}/lib:/usr/lib

    If I use the pgsql-12 directory, I get the error above, if I use the non-12
    POSTGRES_HOME, it loads ok.

    THE PROBLEM:
    /usr/lib64 (next door to /usr/pgsql-12) has /libgssapi_krb5.so.2
    /opt/anaconda/lib also has that library.
    If I put /opt/anaconda/lib on LD_LIBRARY_PATH 1st, everything works.
    Not clear why putting /usr/pgsql-12 on LD_LIBRARY_PATH on the path causes
    the /usr/lib64 lib to get loaded.

    But at least it is clear the problem is which lib verson gets loaded.

Dec 18, 2020
    added relevanceClassifier.py - the actual (untrained) source code for
    the GB based relevance classifier.
    relevanceClassifier.pkl is the trained classifer - trained on the Nov 6 set

Dec 21, 2020
    Got rid of top level config.cfg file in autolittriage so it would not
        be confused with MGI's std configuration file.
        - converted the config file to just be a config file for the tuning
            scripts, read by textTuningLib.py (not used by sampleDataLib.py)
            (this was actually completed Jan 1, 2021)
        - this cleaned things up a lot. Now the file lives in the training
            directory
        - involved creating a preprocessor method for each of the figure text
            algorithms (the relevanceClassifier uses figureTextLegCloseWords50)

Dec 22, 2020
    Wrote sample scripts for lori to incorporate into the littriageload 
    product to invoke the relevanceClassifier:
        getRefsToPredict.py - pulls refs that need to be predicted from db 
        getPredictions.sh - runs the relevanceClassifier and creates pred file

Jan 3, 2021
    Reorganized sdGetRawPrimTriage.py to simplify SQL and get the full ext
    text for refs even if you are using --limit to build a small test set.
        - this makes the small test sets more useful as they have full text.

Jan 5, 2021
    Fixed sklearn future warning fixes about param changes in a couple methods.

Jan 12, 2021
    Added initial automated unit tests for sampleDataLib.py in test/ subdir

Jan 17, 2021
    Completed splitting sampleDataLib.py into baseSampleDataLib.py that is now
        in MLtextTools.
        - so now there are generic BaseSample and ClassifiedSample classes
            in MLtextTools that can be used in other ML projects
        - SampleSet and ClassifiedSampleSet live in MLtextTools too.
        - This removed assumption that the Sample classes are defined in
            sampleDataLib.py, I added "moduleName=" to the meta data of sample
            files.
        - includes automated tests: test_baseSampleDataLib.py in  MLtextTools
        - added --seed option to sdBuild4Split.sh so we can control how the
            test set is built and added buildSmall.sh to automate the building
            of the small test dataset. These made it easier to verify that all
            the changes to sampleDataLib.py had no effect on the sample files.