# Notes

## General Thoughts

I find it better to make 'subfunctions'; this allows for fine-tuning of output, as well as troubleshooting of specific steps.

In 'real world', you have to be careful about malicious injections or horrifically invalid inputs. I typically make a single function to get input and verify it. I'm not nearly as rigid here as I would be otherwise, but I still have some validation for user inputs.

Since we are querying a database we don't control, we also have to account for issues and variations in the database itself. PubMed unfortunately has some information in multiple potential fields.

## Problem Components

Components of Problem:
-   Query NCBI database - can use Entez from Biopython to interact with the API
    -   Requires an Email **(INPUT)**
    -   Requires search query **(INPUT)**
        -   Per instructions, query consists of a keyword and date range **(INPUT)**
-   Minimum information for each article
    -   Title
    -   Abstract
    -   Publication Date
    -   Authors
        - In next step, will need to be able to query authors
-   Return information in a CSV file **(OUTPUT)**

**Looking ahead to Step 2 of project:**

Database Needs:
-   **PMID** is likely to be a sufficient `key` for the SQL database of articles
-   Need to be able to query by **author name**
-   The relationship between authors and articles is MANY TO MANY (authors have more than one publication, and publications can have more than one article)
-   Two cross-linked tables are required

Author Data Table:
-   Should separate first and last name - can either search one field, or the joined version in SQL querying
    -   This should account for people with multiple names in a single field, or with a missing value in one field
-   Generate an `Author ID (key)` for each unique combination as an arbitrary numbering (?)

Article Data Table:
-   Required:
    -   `PMID (key)` - appears to be integers, but will need to verify from CSV file
    -   Title - strings, which can contain some special characters (spaces, commas, colons, semicolons, etc)
    -   Abstract - VERY LONG strings (paragraphs), which can contain some special characters
    -   Publication Date - Date/Time, or we can pass it as a string
-   Considerations:
    -   Number of authors? This could simplify querying, and it is unique to each paper
    -   Additional information, such as Journal Name, Journal Abbreviation, Langugage, Volume, Issue, Keywords, DOI, Pages

Separate Table matching Authors and Papers:
-   Combination key of `PMID` and `Author ID`
-   Can also contain a TRUE/FALSE for 'First Author'

## Structure of a PubMed Record

In [254]:
Entrez.email = 'apatheticgraffiti@gmail.com'
try:
    handle = Entrez.efetch(db = 'pubmed',
                            retmode = 'xml',
                            id = 32019336
                            )
                        
    results = Entrez.read(handle)
    handle.close()
    test_record = results
except Exception as e:
    message = ' '.join(['An error occured when contacting NCBI servers.',
                        'Check your query terms. Consider reattempting',
                        'outside of peak hours. Message: \n',
                        f'{e}'])
    print(message)

-   XML records act as Dictionary Objects with nested components

In [255]:
type(test_record)

Bio.Entrez.Parser.DictionaryElement

-   'PubmedBookArticle' (irrelevant to our project)
-   'PubmedArticle'
    -   List of ['MedlineCitation' (Dictionary Object), 'PubmedData' (Dictionary Object)] items, for each ID used in query; can be iterated over
    -   'PubmedArticle' contains the references, history, publication status, and article IDs associated with the article; none of this is useful to us
    -   'MedlineCitation' contains ['KeywordList', 'GeneralNote', 'SpaceFlightMission', 'CitationSubset', 'OtherID', 'OtherAbstract', 'PMID', 'DateCompleted', 'DateRevised', 'Article', 'MedlineJournalInfo', 'ChemicalList', 'MeshHeadingList']. This is all the information we should require.

As such, we can select `['PubmedArticle'][*idindexinlist*]['MedlineCitation']` for all records

-   **'KeywordList'** is either empty (len = 0) or contains a list of keywords
    -   Can join into a string with `', '. join(record['PubmedArticle'][0]['MedlineCitation']['KeywordList'])`
-   'GeneralNote', 'SpaceFlightMission', 'CitationSubset', 'OtherID', and 'OtherAbstract' are irrelevant to our project
-   **'PMID'** contains the PMID used in query as a special String element
    -   Can be extracted with `str(record['PubmedArticle'][0]['MedlineCitation']['PMID'])`
-   'DateCompleted' is a dictionary with keys ['Year', 'Month', 'Day']; may be useful if no other dates found
-   'DateRevised' is a dictionary with ['Year', 'Month', 'Day']; but likely irrelevant to the project
-   **'Article'** is a dictionary object with keys ['ArticleDate', 'Language', 'ELocationID', 'Journal', 'ArticleTitle', 'Pagination', 'Abstract', 'AuthorList', 'GrantList', 'PublicationTypeList']
    -   **'ArticleDate' is the preferred publication date**; Dictionary Object ['Year', 'Month', 'Day'] keys, but may be missing components or entirely missing. May also have Months in 3 character strings (e.g. 'Feb' instead of '02').
    -   **'Language'** is gives a list element, with **language of the article in standard abbreviation** (e.g., English is 'eng')
        -  Can be extracted with `', '. join(record['PubmedArticle'][*index*]['MedlineCitation']['Article']['Language'])`
    -   **'ELocationID'** is a list of string elements with attributes
        -  Can identify valid **DOI** with `record['PubmedArticle'][*index*]['MedlineCitation']['Article']['ELocationID'][*index*].attributes['EIdType'] == 'doi' and test_record['PubmedArticle'][*index*]['MedlineCitation']['Article']['ELocationID'][*index*].attributes['ValidYN'] == 'Y'`; extract with `str(record['PubmedArticle'][*index*]['MedlineCitation']['Article']['ELocationID']`
    -   'Journal' is a dictionary with keys '[ISSN','JournalIssue','ISOAbbreviation']
        -   'ISSN' is likely irrelevant to our project
        -   'JournalIssue' is a dictionary with keys ['Volume', 'Issue', 'PubDate']
            -   'Volume' gives the volume, if it exists at all
            -   'Issue' gives the issue, if it exists at all
            -   'PubDate' is a dictionary with keys ['Year', 'Month', 'Day'], with similar problems to 'DateCompleted' above - this is the preferred DATE, but may be missing components or otherwise absent
        -   **'ISOAbbreviation'** gives the journal name's **standardized abbreviation**
    -   **'ArticleTitle'** gives the title as a string
    -   **'Pagination'** is a dictionary with keys ['StartPage', 'EndPage', 'MedlinePgn']
        -   All are strings, but may be absent or otherwise missing; 'StartPage' and 'EndPage' are prefered, with extraction from 'MedlinePgn' if absent
    -   **'Abstract'** is a dictionary with key ['AbstractText], which is a list of strings.
        -  Can be extracted with `' '. join(record['PubmedArticle'][*index*]['MedlineCitation']['Article']['Abstract']['AbstractText])`
    -   **'AuthorList'** is a list of dictionaries with keys ['AffiliationInfo', 'Identifier', 'LastName', 'ForeName', 'Initials']
        -   'AffiliationInfo' is irrelevant for our project
        -   'Identifier' is usually blank
        -   'LastName' is last names
        -   'ForeName' is first names, which may be missing
        -   'Initials' are initials preceeding first name, which may be missing
    -   'GrantList' and 'PublicationTypeList' are irrelevant for the project
-   'MedlineJournalInfo' is a dictionary containing keys ['Country', 'MedlineTA', 'NlmUniqueID']; all are irrelevant to the project
-   'ChemicalList' is a list of dictionaries with keys ['RegistryNumber', 'NameOfSubstance']; all are irrelevant to our project
-   'MeshHeadingList' is a list of dictionaries with keys ['QualifierName', 'DescriptorName']; all are irrelevant to our project


In [256]:
test_record['PubmedArticle'][0]['MedlineCitation']['KeywordList'][0]

ListElement([StringElement('CXCR4', attributes={'MajorTopicYN': 'N'}), StringElement('HIV-1 entry inhibitor', attributes={'MajorTopicYN': 'N'}), StringElement('V3 loop', attributes={'MajorTopicYN': 'N'}), StringElement('X4-tropic', attributes={'MajorTopicYN': 'N'}), StringElement('gp120', attributes={'MajorTopicYN': 'N'})], attributes={'Owner': 'NOTNLM'})

## Plan for Code

### Plan for Extraction

Top level content:
-   PMID from ['PMID']
    - otherwise use query ID if a record was returned at all
-   Keywords from joining items in ['KeywordList'] if they exist with length > 0
    - otherwise empty
Content from ['Article']
-   Title from ['Article']['ArticleTitle']
-   Abstract from joining ['Article']['Abstract']['AbstractText'] if ['Article']['Abstract']['AbstractText'] exists and has a length > 0
    - otherwise empty
-   Authors from carefully joining elements of ['Article']['AuthorList']
    - LAST, FIRST order, with items blank if entirely missing. If ForeName missing, use Initials
-   Date from ['Article']['ArticleDate'] in Y/M/D format 
    - Format will automatically will default 'up' for missing components, so missing month is jan, missing day is the first, etc.
    - If missing, try date from ['Journal']
-   Language from joining ['Article']['Language'] if ['Article']['Language'] exists and has a length > 0
    - otherwise empty
-   Pagination from ['Article']['Pagination'] if it exists and has length > 0
    - First and Last as separate; use the combined version if those are absent
    - otherwise blank
-   Content from ['Article']['Journal']
    -   Abbreviation from ['Article']['Journal']['ISOAbbreviation'] if it exists and has length > 0
    -   Journal name in ['Article']['Journal']['Title'] if it exists and has a length > 0 
    -   Issue from ['Article']['Journal']['JournalIssue']['Issue'] if it exists and has length > 0
    -   Volume from ['Article']['Journal']['JournalIssue']['Volume'] if it exists and has length > 0
    -   Date from ['Article']['Journal']['JournalIssue']['PubDate'], prefered over ['Article']['ArticleDate'] - same caveats


## Plan for output format

Overall needs:
-   Establish a format for creating this CSV where each article is only one line
-   Delimiter of CSV needs to be unique enough that it wont cause issues in the text fields with special characters (Title, Abstract)
-   Need 1 row for each article, despite multiple authors: need to create a list of authors with two delimiters (between first/last field, and between authors)

CSV Delimiter: ','

Author Delimiters: '+' between authors, ';;' between fields in (last;;first) order

-   Each line is a single row
-   No header

PMID, PUBLICATIONDATE, TITLE, JOURNALNAME, JOURNALABBREVIATION, VOL, ISSUE, KEYWORDS, AUTHORS[LAST;;FIRST+LAST;;FIRST] LANGUAGE, FIRSTPAGE, LASTPAGE, ABSTRACT

# Problem Solving

## Imports

In [258]:
from Bio import Entrez
from datetime import datetime as dt
import csv
import os
import re

## Step 1: Search, Retrieve IDs

In [259]:
def pubmed_search_ids(keyword, start_date, end_date, email):
    """
    Submits a query to PubMed via ENTREZ
    
    INPUTS:
        keyword (string): Keyword term
        start_date (string): Date in YYYY/MM/DD format
        end_date (string): Date in YYYY/MM/DD format
        email (string): email address, required by NCBI
    RETURNS:
        results (list): list of PMIDs as strings
        Prints error message and suggestions if error message from
        NCBI
    REQUIREMENTS/DEPENDENCIES:
        Entrez from BioPython
    """

    query = (f'({keyword}) AND ("{start_date}"[Date - Publication]'
            f' : "{end_date}"[Date - Publication])'
            )

    Entrez.email = email
    try:
        handle = Entrez.esearch(db='pubmed',
                                sort='relevance',
                                retmax = 200000,
                                retmode='xml',
                                term=query)
        results = Entrez.read(handle)
        handle.close()
        return results['IdList']
    except Exception as e:
        message = ' '.join(['An error occured when contacting NCBI servers.',
                            'Check your query terms. Consider reattempting',
                            'outside of peak hours. Message: \n',
                            f'{e}'])
        print(message)

## Step 2: Fetch a Record

In [260]:
def fetch_details(target_ids, email):
    """
    Queries IDs from PubMed
    Automatically trims output to the Medline Citation

    INPUTS:
        Keyword (string): Keyword term
        start_date (string): Date in YYYY/MM/DD format
        end_date (string): Date in YYYY/MM/DD format
        email (string): email address, required by NCBI
    RETURNS:
        results (list): list of MedlineCitation entries in each result
        Prints error message and suggestions if error message from
        NCBI
    REQUIREMENTS/DEPENDENCIES:
        Entrez from BioPython
    """
    Entrez.email = email
    try:
        handle = Entrez.efetch(db = 'pubmed',
                            retmode = 'xml',
                            id = target_ids)
                            
        results = [result['MedlineCitation'] for result in Entrez.read(handle)['PubmedArticle']]
        handle.close()
        return results
    except Exception as e:
        message = ' '.join(['An error occured when contacting NCBI servers.',
                            'Check your query terms. Consider reattempting',
                            'outside of peak hours. Message: \n',
                            f'{e}'])
        print(message)

## Step 3: Extract Data from a record, to make a row of output

### Extracting and Formatting Dates and Authors

In [281]:
def format_date(date_obj):
    """
    Takes a date object and returns a formatted date time object

    INPUTS:
        date_obj (dict): Dictionary, which may contain keys
                        ['Year', 'Month', 'Day'] with single
                        string content.
                        'Year' is expected to be 4 digit, if exists
                        'Month' is expected to be 2 digit or
                                3 character string, if exists
                        'Day' is expected to be 2 digit, if exists
    RETURNS:
        date (date): Date, with missing/invalid month or day
                     rounded to '01'.
                     Returns an empty string if invalid, or 
                     no year.
    REQUIREMENTS/DEPENDENCIES:
        datetime.datetime as dt
    """
    # Dictionary to translate string months
    months_dict = {
                'jan': '01', 'feb': '02', 'mar': '03', 
                'apr': '04', 'may': '05', 'jun': '06' ,
                'jul': '07', 'aug': '08', 'sep': '09', 
                'oct': '10', 'nov': '11', 'dec': '12',
                }

    # YEAR
    if 'Year' in date_obj.keys():
        year = date_obj['Year'].lower().strip()
        if len(year) < 4 or not year.isdigit():
            year = False
    else:
        year = False
    # MONTH
    if 'Month' in date_obj.keys():
        month = date_obj['Month'].lower().strip()
        if not month.isdigit() and month in months_dict.keys():
            month = months_dict[month]
        elif month.isdigit() and int(month) in range(1,13) and len(month) < 2:
            month = '0' + str(int(month))
        else:
            month = False
    else:
        month = False
    # DAY
    if 'Day' in date_obj.keys():
        day = date_obj['Day'].lower().strip()
        if day.isdigit() and int(day) in range(1,13) and len(day) < 2:
            day = '0' + str(int(day))
        else:
            day = False
    else:
        day = False

    # Format
    if year and month and day:
        date = dt.strptime(f'{year}/{month}/{day}', r'%Y/%m/%d').date()
    elif year and month:
        date = dt.strptime(f'{year}/{month}', r'%Y/%m').date()
    elif year and day:
        date = dt.strptime(f'{year}/{day}', r'%Y/%d').date()
    elif year:
        date = dt.strptime(f'{year}', r'%Y').date()
    else:
        date = ''

    # Return
    return date

In [262]:
def format_author(record, delimiter = '+'):
    """
    Formats an Author from a PubMed Author List into a string
    in 'FIRST (delimiter) LAST (delimiter) INITIALS' format,
    all in lowercase

    INPUTS:
        record (dict): Single entry from a PubMed MedlineCitation 
                already sliced to ['Article']['AuthorList'].
                Expected to potentially contain the single values
                for keys []'ForeName', 'LastName', 'Initials']
        delimiter (str): delimiter between components in the 
                output string. Default is '+' 
    RETURNS:
        name_str (str): string listing of author name
                missing fields are blank. If ForeName absent,
                attempts to replace with Initials
    """
    if 'ForeName' in record.keys() and len(record['ForeName']) > 0:
        first_nm = record['ForeName'].lower().strip()
    elif 'Initials' in record.keys() and len(record['Initials']) > 0:
        first_nm = record['Initials'].lower().strip()
    else:
        first_nm = ''
    if 'Initials' in record.keys() and len(record['Initials']) > 0:
        initials = record['Initials'].lower().strip()
    else:
        initials = ''

    if 'LastName' in record.keys() and len(record['LastName']) > 0:
        last_nm = record['LastName'].lower().strip()
    else:
        last_nm = ''

    name_str = delimiter.join([first_nm,last_nm,initials])

    return name_str

### Journal Data

In [263]:
def journal_content(record, extract_date = False):
    """
    Scrapes ['Article']['Journal'] level content from
    a PubMed MedlineCitation entry:
    Journal Title, Journal ISO Abbreviation, Volume, Issue,
    and Publication Date

    INPUTS:
        record (dict): PubMed MedlineCitation already sliced to
                ['Article']['Journal']
        extract_date (bool): TRUE if extraction of date is
                desired; default is FALSE 
    RETURNS:
        jour_data (list): list of ['Journal'] data items;
                [Journal Title, Journal ISO Abbreviation,
                Volume, Issue, Publication Date]
                missing fields are empty strings
    REQUIREMENTS/DEPENDENCIES:
        format_date(): datetime.datetime as dt
    """
    # JOURNAL NAME
    if 'Title' in record.keys() and len(record['Title']) >0:
        jour_title = str(record['Title']).lower().strip()
    else:
        jour_title = ''

    # JOURNAL ABBREVIATION
    if 'ISOAbbreviation' in record.keys() and len(record['ISOAbbreviation']) >0:
        jour_abbrev = str(record['ISOAbbreviation']).lower().strip()
    else:
        jour_abbrev = ''

    # Journal Issue Items
    if 'JournalIssue' in record.keys():
    # JOURNAL ISSUE
        if 'Issue' in record['JournalIssue'].keys() and len(record['JournalIssue']['Issue']) >0:
            jour_issue = str(record['JournalIssue']['Issue']).lower().strip()
        else:
            jour_issue = ''
    # JOURNAL VOLUME
        if 'Volume' in record['JournalIssue'].keys() and len(record['JournalIssue']['Volume']) >0:
            jour_vol = str(record['JournalIssue']['Volume']).lower().strip()
        else:
            jour_vol = ''
    # JOURNAL DATE
        if extract_date:
            if 'PubDate' in record['JournalIssue'].keys() and len(record['JournalIssue']['PubDate']) >0:
                jour_date = format_date(record['JournalIssue']['PubDate'])
            else:
                jour_date = ''
        else:
            jour_date = ''
    else:
        jour_vol = ''
        jour_issue = ''
        jour_date = ''

    jour_data = [jour_title, jour_abbrev, jour_vol, jour_issue, jour_date]

    return jour_data

### Article Content

In [264]:
def article_content(record, delimiter_btwn = '+', delimiter_within = ';;', 
                    date_format = r'%Y/%m/%d'):
    """
    Scrapes ['Article'] level content from
    a PubMed MedlineCitation entry:
    Title, Abstract, Pageination, Publication Date, Language,
    Authors, Journal Title, Journal ISO Abbreviation,
    Journal Volume, Journal Issue.
    Prefers JOURNAL PUBLICATION DATE over Article Date, if
    it is present.

    INPUTS:
        record (dict): PubMed MedlineCitation already sliced to
                ['Article']
        delimiter_btwn (str): delimiter betwen authors in the
                author list; default is '+'
        delimiter_within (str): delimiter betwen components
                of author name, used in format_author(); 
                default is ';;'
        date_format (str): string for date format using datetime
                object. Default is '%Y/%m/%d' for YYYY/MM/DD format
    RETURNS:
        article_data (list): list of ['Article'] data items;
                [Article Title, Article Abstract, Publication Date,
                Start Page, End Page, Language, Authors, Journal Title,
                Journal ISO Abbreviation, Journal Volume, Journal Issue]
                Missing fields are empty strings
    REQUIREMENTS/DEPENDENCIES:
        datetime.datetime as dt
        format_date()
        format_author(): re
        journal_content()
    """

    # TITLE
    if 'ArticleTitle' in record.keys() and len(record['ArticleTitle']) >0:
        title = str(record['ArticleTitle']).lower().strip()
    else:
        title = ''

    # ABSTRACT
    if 'Abstract' in record.keys() and len(record['Abstract']) >0:
        if 'AbstractText' in record['Abstract'].keys() and len(record['Abstract']['AbstractText']) > 0:
            abstract = ' '.join(record['Abstract']['AbstractText'])
        else:
            abstract = ''
    else:
        abstract = ''

    # DATE (IF NOT WITH JOURNAL)
    if 'ArticleDate' in record.keys() and len(record['ArticleDate']) >0:
        date = format_date(record['ArticleDate'][0])
    else:
        date = ''

    # PAGINATION
    pgn_regex = r'^([0-9a-z]+)(?:[ -:/]+?)([0-9a-z]+)$'
    if 'Pagination' in record.keys() and len(record['Pagination']) >0:
        if 'StartPage' in record['Pagination'].keys() and len(record['Pagination']['StartPage']) > 0:
            page_start = record['Pagination']['StartPage']
        else:
            page_start = False
        if 'EndPage' in record['Pagination'].keys() and len(record['Pagination']['EndPage']) > 0:
            page_end = record['Pagination']['EndPage']
        else:
            page_end = False
        if not (page_start or page_end):
            if 'MedlinePgn' in record['Pagination'].keys() and len(record['Pagination']['MedlinePgn']) > 0:
                search = re.search(pgn_regex, record['Pagination']['MedlinePgn'])
                if search:
                    page_start = search[1]
                    page_end = search[2]
        # If still not valid, just make empty strings
        if not (page_start or page_end):
            page_start = ''
            page_end = ''
    else:
        page_start = ''
        page_end = ''


    # AUTHORS
    if 'AuthorList' in record.keys() and len(record['AuthorList']) > 0:
        authors = delimiter_btwn.join([format_author(author, delimiter_within) for author in record['AuthorList']])
    else:
        authors = ''

    # LANGUAGE
    if 'Language' in record.keys() and len(record['Language']) > 0:
        lang = ', '.join(record['Language'])
    else:
        lang = ''

    # ['Journal'] CONTENT
    jour_content = journal_content(record['Journal'], extract_date = True)
    jour_title, jour_abbrev, jour_vol, jour_issue, jour_date = jour_content

    # PREFER DATE FROM JOURNAL
    if jour_date != '':
        date = jour_date
    if date != '':
        date = date.strftime(date_format)

    # Package Output

    article_content = [title, abstract, date, page_start,
                        page_end, lang, authors, jour_title, 
                        jour_abbrev, jour_vol, jour_issue
                        ]
    # Return Output
    return article_content

### Single Full Record

In [265]:
def scrape_record(record, name_delim_btwn = '+', name_delim_within = ';;', 
                  date_format = r'%Y/%m/%d'):
    """
    Scrapes a MedlineCitation of a PubMed article.
    Returns article data as a list, which can become a single row 
    of a dataframe or other structure.

    INPUTS:
        record (dict): PubMed MedlineCitation
        name_delimi_btwn (str): delimiter betwen authors in the
                author list, used in article_content(); 
                default is '+'
        name_delim_within (str): delimiter betwen components
                of author name, used in format_author() within
                article_content(); default is ';;'
        date_format (str): string for date format using datetime
                object in article_content(); 
                default is '%Y/%m/%d' for YYYY/MM/DD format
    RETURNS:
        content (list): list of article data;
                [PMID, Publication Date, Article Title,
                Journal Title, Journal ISO Abbreviation, 
                Journal Volume, Journal Issue, Keywords, 
                Authors, Language, Pagination (Start),
                Pagination (End), Abstract]
                Missing fields are empty strings
    REQUIREMENTS/DEPENDENCIES:
        datetime.datetime as dt
        format_date()
        format_author(): re
        journal_content()
        article_content()
    """
    # PMID
    if 'PMID' in record.keys() and len(record['PMID']) > 0:
        pmid = str(record['PMID']).lower().strip()
    else:
        pmid = ''
    # KEYWORDS
    if 'KeywordList' in record.keys() and len(record['KeywordList']) > 0:
        keywords = name_delim_btwn.join([', '.join(keywords) for keywords in record['KeywordList']])
    else:
        keywords = ''

    # ARTICLE CONTENT
    article = [i for i in article_content(record['Article'], name_delim_btwn, name_delim_within, date_format)]
    title, abstract, date, page_start = article[0:4]
    page_end, lang, authors, jour_title = article[4:8]
    jour_abbrev, jour_vol, jour_issue = article[8:11]

    # Organize Output into row
    content = [pmid, date, title, jour_title, jour_abbrev,
            jour_vol, jour_issue, keywords, authors,
            lang, page_start, page_end, abstract]
    
    # Return
    return content

## Step 4: Scrape Multiple Records in Chunks, Write to CSV

In [367]:
def chunk_process(keyword, start_date, end_date, email, 
                  path, chunksize = 100, date_format = r'%Y/%m/%d',
                  name_delim_btwn = '+', name_delim_within = ';;', 
                  delim = ',', quote_str = '|', overwrite = False):
    """
    Performs a search of PubMed using a date range and keyword,
    and scrapes these records into a CSV file.

    INPUTS:
        keyword (string): Keyword term
        start_date (string): Date in YYYY/MM/DD format
        end_date (string): Date in YYYY/MM/DD format
        email (string): email address, required by NCBI
        path (string,path): string address of output file path
        chunksize (int): number of records to batch process
                between opening file, used to optimize run time
                based on capabilities of an individual machine;
                default is 100
        date_format (str): string for date format using datetime
                object in article_content(); 
                default is '%Y/%m/%d' for YYYY/MM/DD format
        name_delimi_btwn (str): delimiter betwen authors in the
                author list; 
                default is '+'
        name_delim_within (str): delimiter betwen components
                of author name; 
                default is ';;'
        delim (str): delimiter between values in a single row
                in the CSV output file. Must be a single
                character string;
                default is ','
        quote_str (str): character used to quote entries that
                contain characters in the delimiters;
                default is '|'
        overwrite (bool): True/False value indicating if it is
                desired to overwrite the output file, if it
                already exists;
                default is FALSE

    RETURNS:
        writes records to `path`. Prints happy message summary
        if successful. If error, prints error message and the row
        where the problem occured.

    REQUIREMENTS/DEPENDENCIES:
        os
        csv
        re
        Entrez from BioPython
        datetime.datetime as dt
        pubmed_search_ids()
        scrape_record()
        journal_content()
        article_content()
        format_date()
        format_author()
    """  
    # Perform Search, Get IDS
    target_ids = pubmed_search_ids(keyword, start_date, end_date, email)
    # Chunk Data
    if target_ids:
        try:
            for i in range (0, len(target_ids), chunksize):
                chunk_ids = target_ids[i:i+chunksize]
                records = fetch_details(chunk_ids, email)
                data = [scrape_record(record,name_delim_btwn, name_delim_within, date_format) for record in records]
                # If output file does not exist or overwrite = 'TRUE was called, create file
                if not os.path.isfile(path) or overwrite:
                    with open(path, 'w', encoding = 'utf-8') as outfile:
                        csvwriter = csv.writer(outfile, delimiter = delim, quotechar = quote_str)
                        for row in data:
                            csvwriter.writerow(row)
                # Otherwise, open for appending
                else:
                    with open(path,'a', encoding = 'utf-8') as outfile:
                        csvwriter = csv.writer(outfile, delimiter = delim, quotechar = quote_str)
                        for row in data:
                            csvwriter.writerow(row)
            # Happy Message
            message = [f'Success! \n {len(target_ids)} records for ',
                    'PubMed Search: \n',
                    f'({keyword}) AND ("{start_date}"[Date - Publication]'
                    f' : "{end_date}"[Date - Publication]) \n',
                    f' written to {path}']
            print(''.join(message))
        except Exception as e:
            message = ['An error occured:', e, 
                    'Row with issue:', f'{row}']
            print('\n'.join(message))

## Step 5: Wrap with Basic Input Validation

In [363]:
def scrapper(keyword = None, start_date = None, end_date = None,
                    email = None, path = None, chunksize = 100,
                    date_format = r'%Y/%m/%d',
                    name_delim_btwn = '+', name_delim_within = ';;', 
                    delim = ',', quote_str = '|', overwrite = False):
    """
    Validates inputs and runs scrapper!

    INPUTS:
        PROMPTS FOR INPUT IF MISSING OR INVALID:
            keyword (string): Keyword term
            start_date (string): Date in YYYY/MM/DD format
            end_date (string): Date in YYYY/MM/DD format
            email (string): email address, required by NCBI
            path (string,path): string address of output file path
        DOES NOT PROMPT OR VALIDATE, HAS DEFAULT:
            chunksize (int): number of records to batch process
                    between opening file, used to optimize run time
                    based on capabilities of an individual machine;
                    default is 100
            date_format (str): string for date format using datetime
                    object in article_content(); 
                    default is '%Y/%m/%d' for YYYY/MM/DD format
            name_delimi_btwn (str): delimiter betwen authors in the
                    author list; 
                    default is '+'
            name_delim_within (str): delimiter betwen components
                    of author name; 
                    default is ';;'
            delim (str): delimiter between values in a single row
                    in the CSV output file. Must be a single
                    character string;
                    default is ','
            quote_str (str): character used to quote entries that
                    contain characters in the delimiters;
                    default is '|'
            overwrite (bool): True/False value indicating if it is
                    desired to overwrite the output file, if it
                    already exists;
                    default is FALSE
    RETURNS:
        writes records to `path`. Prints happy message summary
        if successful. If error, prints error message and the row
        where the problem occured.

    REQUIREMENTS/DEPENDENCIES:
        os
        csv
        re
        Entrez from BioPython
        datetime from datetime as dt
        chunk_process()
        pubmed_search_ids()
        scrape_record()
        journal_content()
        article_content()
        format_date()
        format_author()
    """
    ########### DATA CHECKS ################
    # =======================================      
    email_regex = r"^[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?$"
    # Allow user to exit
    quit = False
    while not quit:
        ###### GET INPUTS IF NOT GIVEN IN FUNCTION
        while not email:
            email = input("Enter email for Entrez query, or 'q' to exit: ").lower().strip()
            if email == 'q':
                quit = True
                break
        if not keyword:
            keyword = input("Enter keyword, or 'q' to exit: ").lower().strip()
            if keyword == 'q':
                quit = True
                break
        while not start_date:
            start_date = input("Enter start date (YYYY/MM/DD), or 'q' to exit: ").strip().lower()
            if start_date == 'q':
                quit = True
                break
        while not end_date:
            end_date = input("Enter end date (YYYY/MM/DD), or 'q' to exit: ").strip().lower()
            if end_date == 'q':
                quit = True
                break
        while not path:
            # No .lower() because paths can be case sensitive
            path = input("Enter path for output file (.txt or .csv), or 'q' to exit: ").strip()
            if path == 'q' or 'Q':
                quit = True
                break
        ###### VALIDATE
        if email:
            if not re.search(email_regex, email):
                print("Invalid email.")
                email = None
        if start_date:
        # Checks start_date format, if this fails, it's invalid
            try:
                start_date = dt.strptime(start_date, r'%Y/%m/%d').date()
            except:
                print("Invalid start date.")
                start_date = None
        if end_date:
        # Checks end_date format, if this fails, it's invalid
            try:
                end_date = dt.strptime(end_date, r'%Y/%m/%d').date()
            except:
                print("Invalid date. Try again")
                end_date = None
        # Ensure start_date is before end_date
        if (start_date and end_date) and start_date > end_date:
            print(f"Start Date {start_date} is after {end_date}. Please enter valid dates")
            start_date = None
            end_date = None
        # Check path is valid
        if path:
            try:
                path = os.path.abspath(path)
                if not os.path.isdir(os.path.split(path)[0]):
                    print("Path directory not found.")
                    path = None
                if os.path.basename(path).split('.')[1] not in ('txt', 'csv'):
                    print("File must have a .txt or .csv extension.")
                    path = None
            except:
                print("Invalid path")
                path = None
        # Ensure String Format for dates
        if start_date and end_date:
            if not isinstance(start_date,str):
                start_date = start_date.strftime(r'%Y/%m/%d')
            if not isinstance(end_date, str):
                end_date = end_date.strftime(r'%Y/%m/%d')
        
        # If all valid, break out of validation cycle!
        if start_date and end_date and keyword and email and path:
            break
    if not quit:
        chunk_process(keyword, start_date, end_date, email, 
                  path, chunksize, date_format,
                  name_delim_btwn, name_delim_within, 
                  delim, quote_str, overwrite)
    else:
        print("Processing Stopped.")    

# Fully Wrapped in a Single Doc, with main()

In [5]:
from Bio import Entrez
from datetime import datetime as dt
import csv
import os
import re

def main():
    keyword = 'hiv'
    start_date = '2020/01/01'
    end_date = '2020/08/01'
    email = 'morrigan.mahady@uth.tmc.edu'
    path = 'outfile.txt'
    chunksize = 200
    scrapper(keyword, start_date, end_date, email, path, chunksize = chunksize)

def pubmed_search_ids(keyword, start_date, end_date, email):
    """
    Submits a query to PubMed via ENTREZ
    
    INPUTS:
        keyword (string): Keyword term
        start_date (string): Date in YYYY/MM/DD format
        end_date (string): Date in YYYY/MM/DD format
        email (string): email address, required by NCBI
    RETURNS:
        results (list): list of PMIDs as strings
        Prints error message and suggestions if error message from
        NCBI
    REQUIREMENTS/DEPENDENCIES:
        Entrez from BioPython
    """

    query = (f'({keyword}) AND ("{start_date}"[Date - Publication]'
            f' : "{end_date}"[Date - Publication])'
            )

    Entrez.email = email
    try:
        handle = Entrez.esearch(db='pubmed',
                                sort='relevance',
                                retmax = 200000,
                                retmode='xml',
                                term=query)
        results = Entrez.read(handle)
        handle.close()
        return results['IdList']
    except Exception as e:
        message = ' '.join(['An error occured when contacting NCBI servers.',
                            'Check your query terms. Consider reattempting',
                            'outside of peak hours. Message: \n',
                            f'{e}'])
        print(message)
        
def fetch_details(target_ids, email):
    """
    Queries IDs from PubMed
    Automatically trims output to the Medline Citation

    INPUTS:
        Keyword (string): Keyword term
        start_date (string): Date in YYYY/MM/DD format
        end_date (string): Date in YYYY/MM/DD format
        email (string): email address, required by NCBI
    RETURNS:
        results (list): list of MedlineCitation entries in each result
        Prints error message and suggestions if error message from
        NCBI
    REQUIREMENTS/DEPENDENCIES:
        Entrez from BioPython
    """
    Entrez.email = email
    try:
        handle = Entrez.efetch(db = 'pubmed',
                            retmode = 'xml',
                            id = target_ids)
                            
        results = [result['MedlineCitation'] for result in Entrez.read(handle)['PubmedArticle']]
        handle.close()
        return results
    except Exception as e:
        message = ' '.join(['An error occured when contacting NCBI servers.',
                            'Check your query terms. Consider reattempting',
                            'outside of peak hours. Message: \n',
                            f'{e}'])
        print(message)

def format_date(date_obj):
    """
    Takes a date object and returns a formatted date time object

    INPUTS:
        date_obj (dict): Dictionary, which may contain keys
                        ['Year', 'Month', 'Day'] with single
                        string content.
                        'Year' is expected to be 4 digit, if exists
                        'Month' is expected to be 2 digit or
                                3 character string, if exists
                        'Day' is expected to be 2 digit, if exists
    RETURNS:
        date (date): Date, with missing/invalid month or day
                     rounded to '01'.
                     Returns an empty string if invalid, or 
                     no year.
    REQUIREMENTS/DEPENDENCIES:
        datetime.datetime as dt
    """
    # Dictionary to translate string months
    months_dict = {
                'jan': '01', 'feb': '02', 'mar': '03', 
                'apr': '04', 'may': '05', 'jun': '06' ,
                'jul': '07', 'aug': '08', 'sep': '09', 
                'oct': '10', 'nov': '11', 'dec': '12',
                }

    # YEAR
    if 'Year' in date_obj.keys():
        year = date_obj['Year'].lower().strip()
        if len(year) < 4 or not year.isdigit():
            year = False
    else:
        year = False
    # MONTH
    if 'Month' in date_obj.keys():
        month = date_obj['Month'].lower().strip()
        if not month.isdigit() and month in months_dict.keys():
            month = months_dict[month]
        elif month.isdigit() and int(month) in range(1,13) and len(month) < 2:
            month = '0' + str(int(month))
        else:
            month = False
    else:
        month = False
    # DAY
    if 'Day' in date_obj.keys():
        day = date_obj['Day'].lower().strip()
        if day.isdigit() and int(day) in range(1,13) and len(day) < 2:
            day = '0' + str(int(day))
        else:
            day = False
    else:
        day = False

    # Format
    if year and month and day:
        date = dt.strptime(f'{year}/{month}/{day}', r'%Y/%m/%d').date()
    elif year and month:
        date = dt.strptime(f'{year}/{month}', r'%Y/%m').date()
    elif year and day:
        date = dt.strptime(f'{year}/{day}', r'%Y/%d').date()
    elif year:
        date = dt.strptime(f'{year}', r'%Y').date()
    else:
        date = ''

    # Return
    return date

def format_author(record, delimiter = '+'):
    """
    Formats an Author from a PubMed Author List into a string
    in 'FIRST (delimiter) LAST (delimiter) INITIALS' format,
    all in lowercase

    INPUTS:
        record (dict): Single entry from a PubMed MedlineCitation 
                already sliced to ['Article']['AuthorList'].
                Expected to potentially contain the single values
                for keys []'ForeName', 'LastName', 'Initials']
        delimiter (str): delimiter between components in the 
                output string. Default is '+' 
    RETURNS:
        name_str (str): string listing of author name
                missing fields are blank. If ForeName absent,
                attempts to replace with Initials
    """
    if 'ForeName' in record.keys() and len(record['ForeName']) > 0:
        first_nm = record['ForeName'].lower().strip()
    elif 'Initials' in record.keys() and len(record['Initials']) > 0:
        first_nm = record['Initials'].lower().strip()
    else:
        first_nm = ''
    if 'Initials' in record.keys() and len(record['Initials']) > 0:
        initials = record['Initials'].lower().strip()
    else:
        initials = ''

    if 'LastName' in record.keys() and len(record['LastName']) > 0:
        last_nm = record['LastName'].lower().strip()
    else:
        last_nm = ''

    name_str = delimiter.join([first_nm,last_nm,initials])

    return name_str

def journal_content(record, extract_date = False):
    """
    Scrapes ['Article']['Journal'] level content from
    a PubMed MedlineCitation entry:
    Journal Title, Journal ISO Abbreviation, Volume, Issue,
    and Publication Date

    INPUTS:
        record (dict): PubMed MedlineCitation already sliced to
                ['Article']['Journal']
        extract_date (bool): TRUE if extraction of date is
                desired; default is FALSE 
    RETURNS:
        jour_data (list): list of ['Journal'] data items;
                [Journal Title, Journal ISO Abbreviation,
                Volume, Issue, Publication Date]
                missing fields are empty strings
    REQUIREMENTS/DEPENDENCIES:
        format_date(): datetime.datetime as dt
    """
    # JOURNAL NAME
    if 'Title' in record.keys() and len(record['Title']) >0:
        jour_title = str(record['Title']).lower().strip()
    else:
        jour_title = ''

    # JOURNAL ABBREVIATION
    if 'ISOAbbreviation' in record.keys() and len(record['ISOAbbreviation']) >0:
        jour_abbrev = str(record['ISOAbbreviation']).lower().strip()
    else:
        jour_abbrev = ''

    # Journal Issue Items
    if 'JournalIssue' in record.keys():
    # JOURNAL ISSUE
        if 'Issue' in record['JournalIssue'].keys() and len(record['JournalIssue']['Issue']) >0:
            jour_issue = str(record['JournalIssue']['Issue']).lower().strip()
        else:
            jour_issue = ''
    # JOURNAL VOLUME
        if 'Volume' in record['JournalIssue'].keys() and len(record['JournalIssue']['Volume']) >0:
            jour_vol = str(record['JournalIssue']['Volume']).lower().strip()
        else:
            jour_vol = ''
    # JOURNAL DATE
        if extract_date:
            if 'PubDate' in record['JournalIssue'].keys() and len(record['JournalIssue']['PubDate']) >0:
                jour_date = format_date(record['JournalIssue']['PubDate'])
            else:
                jour_date = ''
        else:
            jour_date = ''
    else:
        jour_vol = ''
        jour_issue = ''
        jour_date = ''

    jour_data = [jour_title, jour_abbrev, jour_vol, jour_issue, jour_date]

    return jour_data

def article_content(record, delimiter_btwn = '+', delimiter_within = ';;', 
                    date_format = r'%Y/%m/%d'):
    """
    Scrapes ['Article'] level content from
    a PubMed MedlineCitation entry:
    Title, Abstract, Pageination, Publication Date, Language,
    Authors, Journal Title, Journal ISO Abbreviation,
    Journal Volume, Journal Issue.
    Prefers JOURNAL PUBLICATION DATE over Article Date, if
    it is present.

    INPUTS:
        record (dict): PubMed MedlineCitation already sliced to
                ['Article']
        delimiter_btwn (str): delimiter betwen authors in the
                author list; default is '+'
        delimiter_within (str): delimiter betwen components
                of author name, used in format_author(); 
                default is ';;'
        date_format (str): string for date format using datetime
                object. Default is '%Y/%m/%d' for YYYY/MM/DD format
    RETURNS:
        article_data (list): list of ['Article'] data items;
                [Article Title, Article Abstract, Publication Date,
                Start Page, End Page, Language, Authors, Journal Title,
                Journal ISO Abbreviation, Journal Volume, Journal Issue]
                Missing fields are empty strings
    REQUIREMENTS/DEPENDENCIES:
        datetime.datetime as dt
        format_date()
        format_author(): re
        journal_content()
    """

    # TITLE
    if 'ArticleTitle' in record.keys() and len(record['ArticleTitle']) >0:
        title = str(record['ArticleTitle']).lower().strip()
    else:
        title = ''

    # ABSTRACT
    if 'Abstract' in record.keys() and len(record['Abstract']) >0:
        if 'AbstractText' in record['Abstract'].keys() and len(record['Abstract']['AbstractText']) > 0:
            abstract = ' '.join(record['Abstract']['AbstractText'])
        else:
            abstract = ''
    else:
        abstract = ''

    # DATE (IF NOT WITH JOURNAL)
    if 'ArticleDate' in record.keys() and len(record['ArticleDate']) >0:
        date = format_date(record['ArticleDate'][0])
    else:
        date = ''

    # PAGINATION
    pgn_regex = r'^([0-9a-z]+)(?:[ -:/]+?)([0-9a-z]+)$'
    if 'Pagination' in record.keys() and len(record['Pagination']) >0:
        if 'StartPage' in record['Pagination'].keys() and len(record['Pagination']['StartPage']) > 0:
            page_start = record['Pagination']['StartPage']
        else:
            page_start = False
        if 'EndPage' in record['Pagination'].keys() and len(record['Pagination']['EndPage']) > 0:
            page_end = record['Pagination']['EndPage']
        else:
            page_end = False
        if not (page_start or page_end):
            if 'MedlinePgn' in record['Pagination'].keys() and len(record['Pagination']['MedlinePgn']) > 0:
                search = re.search(pgn_regex, record['Pagination']['MedlinePgn'])
                if search:
                    page_start = search[1]
                    page_end = search[2]
        # If still not valid, just make empty strings
        if not (page_start or page_end):
            page_start = ''
            page_end = ''
    else:
        page_start = ''
        page_end = ''


    # AUTHORS
    if 'AuthorList' in record.keys() and len(record['AuthorList']) > 0:
        authors = delimiter_btwn.join([format_author(author, delimiter_within) for author in record['AuthorList']])
    else:
        authors = ''

    # LANGUAGE
    if 'Language' in record.keys() and len(record['Language']) > 0:
        lang = ', '.join(record['Language'])
    else:
        lang = ''

    # ['Journal'] CONTENT
    jour_content = journal_content(record['Journal'], extract_date = True)
    jour_title, jour_abbrev, jour_vol, jour_issue, jour_date = jour_content

    # PREFER DATE FROM JOURNAL
    if jour_date != '':
        date = jour_date
    if date != '':
        date = date.strftime(date_format)

    # Package Output

    article_content = [title, abstract, date, page_start,
                        page_end, lang, authors, jour_title, 
                        jour_abbrev, jour_vol, jour_issue
                        ]
    # Return Output
    return article_content

def scrape_record(record, name_delim_btwn = '+', name_delim_within = ';;', 
                  date_format = r'%Y/%m/%d'):
    """
    Scrapes a MedlineCitation of a PubMed article.
    Returns article data as a list, which can become a single row 
    of a dataframe or other structure.

    INPUTS:
        record (dict): PubMed MedlineCitation
        name_delimi_btwn (str): delimiter betwen authors in the
                author list, used in article_content(); 
                default is '+'
        name_delim_within (str): delimiter betwen components
                of author name, used in format_author() within
                article_content(); default is ';;'
        date_format (str): string for date format using datetime
                object in article_content(); 
                default is '%Y/%m/%d' for YYYY/MM/DD format
    RETURNS:
        content (list): list of article data;
                [PMID, Publication Date, Article Title,
                Journal Title, Journal ISO Abbreviation, 
                Journal Volume, Journal Issue, Keywords, 
                Authors, Language, Pagination (Start),
                Pagination (End), Abstract]
                Missing fields are empty strings
    REQUIREMENTS/DEPENDENCIES:
        datetime.datetime as dt
        format_date()
        format_author(): re
        journal_content()
        article_content()
    """
    # PMID
    if 'PMID' in record.keys() and len(record['PMID']) > 0:
        pmid = str(record['PMID']).lower().strip()
    else:
        pmid = ''
    # KEYWORDS
    if 'KeywordList' in record.keys() and len(record['KeywordList']) > 0:
        keywords = name_delim_btwn.join([', '.join(keywords) for keywords in record['KeywordList']])
    else:
        keywords = ''

    # ARTICLE CONTENT
    article = [i for i in article_content(record['Article'], name_delim_btwn, name_delim_within, date_format)]
    title, abstract, date, page_start = article[0:4]
    page_end, lang, authors, jour_title = article[4:8]
    jour_abbrev, jour_vol, jour_issue = article[8:11]

    # Organize Output into row
    content = [pmid, date, title, jour_title, jour_abbrev,
            jour_vol, jour_issue, keywords, authors,
            lang, page_start, page_end, abstract]
    
    # Return
    return content

def chunk_process(keyword, start_date, end_date, email, 
                  path, chunksize = 100, date_format = r'%Y/%m/%d',
                  name_delim_btwn = '+', name_delim_within = ';;', 
                  delim = ',', quote_str = '|', overwrite = False):
    """
    Performs a search of PubMed using a date range and keyword,
    and scrapes these records into a CSV file.

    INPUTS:
        keyword (string): Keyword term
        start_date (string): Date in YYYY/MM/DD format
        end_date (string): Date in YYYY/MM/DD format
        email (string): email address, required by NCBI
        path (string,path): string address of output file path
        chunksize (int): number of records to batch process
                between opening file, used to optimize run time
                based on capabilities of an individual machine;
                default is 100
        date_format (str): string for date format using datetime
                object in article_content(); 
                default is '%Y/%m/%d' for YYYY/MM/DD format
        name_delimi_btwn (str): delimiter betwen authors in the
                author list; 
                default is '+'
        name_delim_within (str): delimiter betwen components
                of author name; 
                default is ';;'
        delim (str): delimiter between values in a single row
                in the CSV output file. Must be a single
                character string;
                default is ','
        quote_str (str): character used to quote entries that
                contain characters in the delimiters;
                default is '|'
        overwrite (bool): True/False value indicating if it is
                desired to overwrite the output file, if it
                already exists;
                default is FALSE

    RETURNS:
        writes records to `path`. Prints happy message summary
        if successful. If error, prints error message and the row
        where the problem occured.

    REQUIREMENTS/DEPENDENCIES:
        os
        csv
        re
        Entrez from BioPython
        datetime.datetime as dt
        pubmed_search_ids()
        scrape_record()
        journal_content()
        article_content()
        format_date()
        format_author()
    """  
    # Perform Search, Get IDS
    target_ids = pubmed_search_ids(keyword, start_date, end_date, email)
    # Chunk Data
    if target_ids:
        try:
            for i in range (0, len(target_ids), chunksize):
                chunk_ids = target_ids[i:i+chunksize]
                records = fetch_details(chunk_ids, email)
                data = [scrape_record(record,name_delim_btwn, name_delim_within, date_format) for record in records]
                # If output file does not exist or overwrite = 'TRUE was called, create file
                if not os.path.isfile(path) or overwrite:
                    with open(path, 'w', encoding = 'utf-8') as outfile:
                        csvwriter = csv.writer(outfile, delimiter = delim, quotechar = quote_str)
                        for row in data:
                            csvwriter.writerow(row)
                # Otherwise, open for appending
                else:
                    with open(path,'a', encoding = 'utf-8') as outfile:
                        csvwriter = csv.writer(outfile, delimiter = delim, quotechar = quote_str)
                        for row in data:
                            csvwriter.writerow(row)
            # Happy Message
            message = [f'Success! \n {len(target_ids)} records for ',
                    'PubMed Search: \n',
                    f'({keyword}) AND ("{start_date}"[Date - Publication]'
                    f' : "{end_date}"[Date - Publication]) \n',
                    f' written to {path}']
            print(''.join(message))
        except Exception as e:
            message = ['An error occured:', e, 
                    'Row with issue:', f'{row}']
            print('\n'.join(message))

def scrapper(keyword = None, start_date = None, end_date = None,
                    email = None, path = None, chunksize = 100,
                    date_format = r'%Y/%m/%d',
                    name_delim_btwn = '+', name_delim_within = ';;', 
                    delim = ',', quote_str = '|', overwrite = False):
    """
    Validates inputs and runs scrapper!

    INPUTS:
        PROMPTS FOR INPUT IF MISSING OR INVALID:
            keyword (string): Keyword term
            start_date (string): Date in YYYY/MM/DD format
            end_date (string): Date in YYYY/MM/DD format
            email (string): email address, required by NCBI
            path (string,path): string address of output file path
        DOES NOT PROMPT OR VALIDATE, HAS DEFAULT:
            chunksize (int): number of records to batch process
                    between opening file, used to optimize run time
                    based on capabilities of an individual machine;
                    default is 100
            date_format (str): string for date format using datetime
                    object in article_content(); 
                    default is '%Y/%m/%d' for YYYY/MM/DD format
            name_delimi_btwn (str): delimiter betwen authors in the
                    author list; 
                    default is '+'
            name_delim_within (str): delimiter betwen components
                    of author name; 
                    default is ';;'
            delim (str): delimiter between values in a single row
                    in the CSV output file. Must be a single
                    character string;
                    default is ','
            quote_str (str): character used to quote entries that
                    contain characters in the delimiters;
                    default is '|'
            overwrite (bool): True/False value indicating if it is
                    desired to overwrite the output file, if it
                    already exists;
                    default is FALSE
    RETURNS:
        writes records to `path`. Prints happy message summary
        if successful. If error, prints error message and the row
        where the problem occured.

    REQUIREMENTS/DEPENDENCIES:
        os
        csv
        re
        Entrez from BioPython
        datetime from datetime as dt
        chunk_process()
        pubmed_search_ids()
        scrape_record()
        journal_content()
        article_content()
        format_date()
        format_author()
    """
    ########### DATA CHECKS ################
    # =======================================      
    email_regex = r"^[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?$"
    # Allow user to exit
    quit = False
    while not quit:
        ###### GET INPUTS IF NOT GIVEN IN FUNCTION
        while not email:
            email = input("Enter email for Entrez query, or 'q' to exit: ").lower().strip()
            if email == 'q':
                quit = True
                break
        if not keyword:
            keyword = input("Enter keyword, or 'q' to exit: ").lower().strip()
            if keyword == 'q':
                quit = True
                break
        while not start_date:
            start_date = input("Enter start date (YYYY/MM/DD), or 'q' to exit: ").strip().lower()
            if start_date == 'q':
                quit = True
                break
        while not end_date:
            end_date = input("Enter end date (YYYY/MM/DD), or 'q' to exit: ").strip().lower()
            if end_date == 'q':
                quit = True
                break
        while not path:
            # No .lower() because paths can be case sensitive
            path = input("Enter path for output file (.txt or .csv), or 'q' to exit: ").strip()
            if path == 'q' or 'Q':
                quit = True
                break
        ###### VALIDATE
        if email:
            if not re.search(email_regex, email):
                print("Invalid email.")
                email = None
        if start_date:
        # Checks start_date format, if this fails, it's invalid
            try:
                start_date = dt.strptime(start_date, r'%Y/%m/%d').date()
            except:
                print("Invalid start date.")
                start_date = None
        if end_date:
        # Checks end_date format, if this fails, it's invalid
            try:
                end_date = dt.strptime(end_date, r'%Y/%m/%d').date()
            except:
                print("Invalid date. Try again")
                end_date = None
        # Ensure start_date is before end_date
        if (start_date and end_date) and start_date > end_date:
            print(f"Start Date {start_date} is after {end_date}. Please enter valid dates")
            start_date = None
            end_date = None
        # Check path is valid
        if path:
            try:
                path = os.path.abspath(path)
                if not os.path.isdir(os.path.split(path)[0]):
                    print("Path directory not found.")
                    path = None
                if os.path.basename(path).split('.')[1] not in ('txt', 'csv'):
                    print("File must have a .txt or .csv extension.")
                    path = None
            except:
                print("Invalid path")
                path = None
        # Ensure String Format for dates
        if start_date and end_date:
            if not isinstance(start_date,str):
                start_date = start_date.strftime(r'%Y/%m/%d')
            if not isinstance(end_date, str):
                end_date = end_date.strftime(r'%Y/%m/%d')
        
        # If all valid, break out of validation cycle!
        if start_date and end_date and keyword and email and path:
            break
    if not quit:
        chunk_process(keyword, start_date, end_date, email, 
                  path, chunksize, date_format,
                  name_delim_btwn, name_delim_within, 
                  delim, quote_str, overwrite)
    else:
        print("Processing Stopped.")

if __name__ == "__main__":
    main()

An error occured when contacting NCBI servers. Check your query terms. Consider reattempting outside of peak hours. Message: 
 HTTP Error 500: Internal Server Error
