## Purpose.

Here we collect abstracts and titles from [*Strategic Management Journal*](https://onlinelibrary.wiley.com/journal/10970266), flagship publication of the [Strategic Management Society](https://www.strategicmanagement.net/).  This information is to be used as part of an attempt to apply text classification to charting the progression of business strategy.

## Approach.

Titles and abstracts are available without paywall login. But we have to do this in ***three stages***.  

⓵ **Get the URL for the each volume**. We start with the journal's hope page.  Our essential information is embedded in a frame depicted below on the right-hand side of the page. 

⓶ **Get a list of issue URLs**.  Each volume page contains thumbnail images of individual issues. These include URLs to the individual isses. 

⓷ **Collect lists of titles**.  Follow each issue's URL to its issue table of contents. The tables of contents contain titles, as well as URLs to pages for individual articles.

⓸ **Collect abstracts**.  Abstracts are accessible from individual articles. We have to get the abstracts from these individual-article pages.






In [1]:
# Import libaries
import requests as req
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import itertools as it
import json, time, io, os, shutil, datetime, re, sys, codecs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import ActionChains
from selenium.webdriver.common.keys import Keys


from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

In [2]:
# We use lots of list-comprehension, which drives requests.get operations.
# We need to "throttle" these, so as to avoid the appearance of a DDoS
# attack.  We accomplish this by a get_sleep function.  This function 
# executes a req.get operation, returning the result, with a one-second
# delay.
def sleep_get(url, headers):
    time.sleep(np.random.uniform(low = 0.75,
                                 high = 2.75))
    return(req.get(url,
                   headers = headers))
#
# We also require a Selenium-based version of sleep_get.  This differs
# from the Requests-based version above, in that a separate step is 
# required for the webdriver to navigate to the specified url.
def selenium_sleep_get(url):
    time.sleep(np.random.uniform(low = 1.25,
                                 high = 3.75))
    driver.get(url)
    return BeautifulSoup(driver.page_source,
                         'lxml')


#
# Partition a list into a specified number of bins.  Our inputs
# are:
# ⧐ parted_list is the list to be partitioned;
# ⧐ partition_counts specifies the number of bins into which
#   parted_list is divided.
# We produce an enumerated dictionary of the list partitions.
def partition_list(parted_list, partition_counts):
    parted_list = np.sort(np.array(parted_list))
    partition_len = int(np.ceil(len(parted_list)/partition_counts))
    partitions = [np.array(object = range(partition_len)) + part * partition_len
                     for part in range(partition_counts)]
    partitions[-1] = np.arange(start = partitions[-1][0],
                               stop = parted_list.shape[0])
    return dict(enumerate([list(parted_list[part])
                             for part in partitions]))


In [3]:
sms_url = 'https://www.strategicmanagement.net/'
smj_url = 'https://onlinelibrary.wiley.com'
headers = {
    'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
smj_html = sleep_get(smj_url + '/loi/10970266',
                   headers = headers)
smj_soup = BeautifulSoup(smj_html.content, 'lxml')
smj_login = json.load(open('/Users/nahamlet/Documents/Uncertainty Research/society logins/smj_logon.JSON'))

In [4]:
# ⓪ Initiate websession with society website in order to log in.
options = webdriver.ChromeOptions()
preferences = {'download.default_directory' : '../data/',
               'download.prompt_for_download' : False,
               'download.directory_upgrade' : True,
               "plugins.always_open_pdf_externally": True,
               'download.extensions_to_open' : 'applications/pdf',
               'plugins.plugins_list' : [{'enabled' : False,
                                          'name' : 'Chrome PDF Viewer'}]}
options.add_experimental_option('prefs', preferences)
#  
driver = webdriver.Chrome(chrome_options = options)



  if sys.path[0] == '':


### 🛑 Sign into society website 🛑

In [5]:
driver.get(sms_url)
time.sleep(np.random.uniform(low = 4.5,
                             high = 6))
driver.find_element_by_link_text('Log In').click()
time.sleep(np.random.uniform(low = 2.5,
                             high = 5))
driver.find_element_by_id('username').send_keys(smj_login.get('username'))
driver.find_element_by_id('password').send_keys(smj_login.get('password'))
driver.find_element_by_id('loginSubmit').click()

# ⓵ Capture the volumes and their urls. 
#    ⓐ Navigate to the journal's homepage.  We need to do this in three steps
#       after logging in.  Shortcuts appear to break session integrity.  
#       Get the entire page source as a BeautifulSoup object.
driver.get('https://www.strategicmanagement.net/smj/overview/overview')
time.sleep(np.random.uniform(low = 1.25,
                             high = 2.5))
driver.get('https://www.strategicmanagement.net/wiley/smj')
time.sleep(np.random.uniform(low = 1.25,
                             high = 2.5))
driver.get('https://onlinelibrary.wiley.com/loi/10970266')
smj_soup = BeautifulSoup(driver.page_source,
                         'lxml')

In [6]:
#    ⓑ Instantiate an empty dictionary within which to capture the volumes.
smj_volumes = dict()
#
#    ⓒ Add current-decade volumes to the dictionary.
smj_volumes.update(
        {vol.attrs.get('title') : {'href' : vol.attrs.get('href')}
            for vol in smj_soup.find('div', {'class' : 'loi--aside__left'})\
                               .find('li', {'class' : 'active'})} )
#
#    ⓓ Prior-decade volumes are embedded in nested lists.  We get these
#       through a two-level dictionary comprehension.  Add the resulting
#       dictionary to our smj_volumes dictionary using the dictionary.update()
#       method.
smj_volumes.update(
        {vol.find('a').attrs.get('title') : {'href' : vol.find('a').attrs.get('href')}
             for decade in smj_soup.find('ul', {'class' : 'rlist loi__list'})\
                                   .find_all('li', {'class' : 'nested'})
             for vol in decade.find_all('li')}  )

In [7]:
# Only perform ⓶ and ⓷ if not previously done, as evidenced by
# absence of JSON file in specified directory.  Otherwise, read in the JSON file.
if os.path.isfile('../data/stratMgtJrnlFullText.json'):
    smj_volumes = json.load(open('../data/stratMgtJrnlFullText.json'))
else:
    # ⓶ Cycle through the volumes, now, and add issue information for each.
    for vol in smj_volumes.keys():
        vol_soup = selenium_sleep_get(smj_url + smj_volumes.get(vol).get('href'))
        smj_volumes.get(vol).update(
            {'issues' : {issue.find('h4').text : {'href' : issue.find('h4').find('a').attrs.get('href'),
                                                  'coverDate' : issue.find('div', {'class' : 'coverDate'}).text}
                                                    for issue in vol_soup.find('ul', {'class' : 'rlist loi__issues'})\
                                                                     .find_all('li', {'class' : 'card clearfix'})}})
    # ⓷ Scan through all issues for all volumes.  
    #    ⓐ For each issue, assemble an  issue table of contents, with an entry for 
    #       each article.  The entry includes a title, list of authors, article url, 
    #       and a url for the pdf.
    for (vol, vol_summy) in smj_volumes.items():
        for (issue, issue_summy) in vol_summy.get('issues').items():
            print((vol, issue))
            issue_soup = selenium_sleep_get(smj_url + smj_volumes.get(vol).get('issues').get(issue).get('href'))
            issue_summy.update(
                {'issue_toc' : 
                 dict(enumerate([{'title' : item.find('a', {'class' : 'issue-item__title visitable'})\
                                                 .find('h2')\
                                                 .text\
                                                 .lower(),
                                 'authors' : [author.text\
                                                    .replace('\n ', '')\
                                                    .lower()
                                                 for author in item.find_all('span', {'class' : 'author-style'})],
                                   'href' : item.find('a', {'class' : 'issue-item__title visitable'}).attrs.get('href'),
                                   'pdf_url' : item.find('a', {'title' : 'EPDF'}).attrs.get('href')}
                                    for container in issue_soup.find('div', {'class' :  'table-of-content'})\
                                                               .find('div', {'class' : 'table-of-content'})\
                                                               .find_all('div', {'class' : 'card issue-items-container exportCitationWrapper'})
                                    for item in container.find_all('div', {'class', 'issue-item'})
                                    if item.find('a').find('h2').text.lower() not in ['erratum',
                                                                                      'masthead',
                                                                                      'announcement',
                                                                                      'issue information']]))
                }  
            )
#
# ⓸ Export the dictionary produced during ⓷ to a JSON file on local storage.
#    ⓑ Export the dictionary produced during ⓷ to a JSON file on local storage.
with io.open('../data/stratMgtJrnlFullText.json', 'w', encoding = 'utf-8') as f:
    json.dump(smj_volumes, 
              f, 
              ensure_ascii = False, 
              indent = 4)

In [8]:
corpus_idx = dict(enumerate([(vol, issue, article)
                                for (vol, vol_issues) in smj_volumes.items()
                                for (issue, issue_articles) in vol_issues.get('issues').items()
                                for article in issue_articles.get('issue_toc').keys()]))
# corpus_idx.update({art_key : {'art_tuple' : art_tuple,
#                                'collected' : False}
#                     for (art_key, art_tuple) in corpus_idx.items()})
# with io.open('../data/strat_mgt_j_idx.json', 'w', encoding = 'utf-8') as f:
#     json.dump(corpus_idx, 
#               f, 
#               ensure_ascii = False, 
#               indent = 4)

In [9]:
# ⓹ Acquire the full-text article. Most volumnes have the text in pdf format.
#    We use that as our approach, therefore.  
#    ⓐ Clean all pdfs out of the download directory so that we only have to
#       worry about one pdf.
def download_pdf_article(url):
    download_dir = '/Users/nahamlet/Downloads'
    os.chdir('/Users/nahamlet/Documents/GitHub/uncertainty-research/BizStratTopicAnalysis')
    pdf_files = [file 
                 for file in os.listdir(download_dir)
                 if 'pdf' in file]
    for file in pdf_files:
        os.remove(os.path.join(download_dir, file))
    #
    #    ⓑ Download the pdf-format full-text article. We use Selenium for this purpose.
    #       This must be done in two steps, because the actual file-download trigger
    #       is buried in a drop-down menu. We must go through this in order to get 
    #       the actual download. Also, we "drill" down to the actual triggers
    #       in multiple steps, in order to avoid ambiguiities. We introduce a pause
    #       in order for the session state to catch up.
    driver.get(url)
    time.sleep(np.random.uniform(low = 5,
                                 high = 7.5))
    driver.find_element_by_css_selector('div.navbar-download.dropdown-widget')\
          .find_element_by_css_selector('button.dropdown-trigger.btn').click()
    time.sleep(np.random.uniform(low = 2.5,
                                 high = 3.5))
    driver.find_element_by_css_selector('div.navbar-download.dropdown-widget')\
          .find_element_by_css_selector('div.row.flex-container')\
          .find_element_by_css_selector('span.icon.material-icons').click()
    time.sleep(np.random.uniform(low = 2.5,
                                 high = 3.5))
    driver.find_element_by_css_selector('div.navbar-download.dropdown-widget')\
          .find_element_by_css_selector('button.dropdown-trigger.btn').click()
    #
    #    ⓒ Move the journal article to the data directory.
    if os.path.isfile('./data/journal_article.pdf'):
        os.remove('./data/journal_article.pdf')
    if os.path.isfile('./data/journal_article.txt'):
        os.remove('./data/journal_article.txt')
    time.sleep(np.random.uniform(low = 15,
                                 high = 25))
    pdf_files = [file 
                 for file in os.listdir(download_dir)
                 if 'pdf' in file][0]
    shutil.move(os.path.join(download_dir, pdf_files),
                './data/journal_article.pdf')
    pass



### <font color='#0051ba'> Reference Notes: </color>

#### <font color = '#2767ff'> Extracting pdf into text.</color>

This Stack overflow [Extracting text from a PDF file using PDFMiner in python?](https://stackoverflow.com/questions/26494211/extracting-text-from-a-pdf-file-using-pdfminer-in-python) demonstrates the extraction of a multi-column pdf article into a coherent text string. The essential point here is that the `pdfminer` package seems to be the one for us.  We get our preferred pattern for extracting pdf into text from the [pdfminer.six](https://pdfminersix.readthedocs.io/en/latest/tutorial/composable.html) documentation.

#### <font color = '#2767ff'>Detecting parenthetically-delimited substrings.</color>

This stack overflow [Python regex: matching a parenthesis within parenthesis](https://stackoverflow.com/questions/5357460/python-regex-matching-a-parenthesis-within-parenthesis) contains patterns for finding parentheses pairs. Most of these will be convenient to remove reference citations from our full-text articles.  The second, containing the pattern

    `re.findall(r'\([^()]*\)', jrnl_article_raw_concat)`

appears particularly convenient.



In [10]:
def pdf_article_to_string():
    #    ⓓ Extract pdf into text string.
    #       ⅰ. We first use pdfminer to extract the article pdf into a string.
    os.chdir('/Users/nahamlet/Documents/GitHub/uncertainty-research/BizStratTopicAnalysis')
    output_string = StringIO()
    with open('./data/journal_article.pdf', 'rb') as in_file:
        parser = PDFParser(in_file)
        doc = PDFDocument(parser)
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)
    article_string = output_string.getvalue()
    #
    #       ⅱ. This string is quite "dirty", with lots of editorial markings and other
    #          junk.  We want to get rid of this. We first try to segment the document
    #          using section demarcations as proxies for labels.  We take only the
    #          substring between the 'INTRODUCTION' and the 'ACKNOWLEDGMENTS' 
    #          section headers.  If these are not present, we use alternative, 
    #          noisier substrings. We also check if the APPENDIX follows the
    #          REFERENCES, in which case we concatenate the appendix onto our string.
    if re.search('REFERENCES', article_string) is not None:
        refs = re.search('REFERENCES', article_string)
    else:
        refs = None
    if re.search('APPEND', article_string) is not None:
        apdx = re.search('APPEND', article_string)
    else:
        apdx = None
    if re.search('ACKNOWLEDGEMENTS', article_string) is not None:
        ack = re.search('ACKNOWLEDGEMENTS', article_string)
    else:
        ack = None
    if re.search('INTRODUCTION', article_string) is not None:
        intro = re.search('INTRODUCTION', article_string)
    else:
        intro = None
    article_start = intro.end() if intro is not None else 0
    article_end = ((-1 if refs is None
                       else refs.start()) if ack is None
                                          else ack.start())
    article_string_body = article_string[article_start:article_end]
    if (apdx is not None) and (apdx.start() > refs.end()):
        article_string_body = article_string_body + article_string[apdx.start():]
    #
    #       ⅲ. We now clean up our string.  We first split the string into 
    #          a list of substrings, and filter out substrings containing 
    #          editorial markers. We then rejoin our list and remove unwanted
    #          characters from the reassembled string.
    art_string_body_clean = ' '.join([article_line.replace('\n', ' ')
                                                 .lower()
                                        for article_line in article_string_body.split('\n')
                                        if (('Strategic Management Journal' not in article_line)
                                                and ('Strat. Mgmt. J.' not in  article_line)
                                                and ('Wiley' not in article_line)
                                                and ('Final revision' not in article_line)
                                                and ('\n' != article_line)
                                                and ('\x0c' not in article_line)
                                                and ('DOI:' not in article_line)
                                                and (not bool(re.match('\d+\n', article_line)))
                                                and (not bool(re.match('\d+.\d+\n', article_line)))
                                                and (not bool(re.match('−\d+.\d+\n', article_line)))
                                                and (len(article_line.replace('\n', '')) > 2)
                                                and ('∗' not in article_line)
                                                and ('†' not in article_line)
                                                and ('|' not in article_line)
                                       )]).replace('- ', '')\
                                          .replace('“', '')\
                                          .replace('”', '')\
                                          .replace('.', '')\
                                          .replace(',', '')\
                                          .replace('…', '')\
                                          .replace('*', '')\
                                          .replace("\'", '')\
                                          .replace('/', '')\
                                          .replace('⁄⁄', '')\
                                          .replace('<', '')\
                                          .replace('>', '')\
                                          .replace('~', '')\
                                          .replace(':', '')\
                                          .replace('--', '')\
                                          .replace('[', '')\
                                          .replace(']', '')\
                                          .replace('?', '')\
                                          .replace("'", '')\
                                          .replace("‘", '')\
                                          .replace("’", '')\
                                          .replace('copyright', '')\
                                          .replace('Strategic Management Journal', '')\
                                          .replace('strategic management journal', '')
    art_string_body_clean = re.sub('\d',     # ⇚ Remove Numbers
                                   '', 
                                   re.sub(r'\([^()]*\)', 
                                          '', 
                                          art_string_body_clean))
    art_string_body_clean = ' '.join([token   # ⇚ Remove multiple spaces
                                        for token in art_string_body_clean.split(' ')
                                        if len(token) > 1])
    return {'full_text' : art_string_body_clean,
            'time_collected' : datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ'),
            'word_count' : len(art_string_body_clean.split(' '))}


In [11]:
def add_full_text_to_json(art_key):
    #    ⓔ Add full-text string to our json.
    #       ⅰ. Extract the randomly selected article from the 
    #          corpus JSON file.
    os.chdir('/Users/nahamlet/Documents/GitHub/uncertainty-research/BizStratTopicAnalysis')
    smj_volumes = json.load(open('./data/stratMgtJrnlFullText.json'))
    (vol, issue, article_idx) = corpus_idx.get(art_key).get('art_tuple')
    article = smj_volumes.get(vol).get('issues').get(issue).get('issue_toc').get(article_idx)
    print(((vol, issue, article_idx),
           datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))
    #
    #       ⅱ. Invoke locally-defined program `download_pdf_article` to download
    #          the randomly-selected article to local storage.
    download_pdf_article(article.get('pdf_url'))
    #
    #       ⅲ. Invoke locally-defined program `pdf_article_to_string` convert
    #           the pdf fiel to a "clean" full-text string. Append to the 
    #           corresponding JSON branch.
    article.update(pdf_article_to_string())
    #
    #       ⅳ. Update our corpus index JSON to change the state of our 
    #          randomly-selected article to `collected`. Also record the
    #          time collected.
    corpus_idx.get(art_key).update({'collected' : True,
                                    'time_collected' : article.get('time_collected')})
    #
    #       ⅴ. Save our corpus-index and corpus JSON to local storage.
    with io.open('./data/stratMgtJrnlFullText.json', 'w', encoding = 'utf-8') as f:
        json.dump(smj_volumes, 
                  f, 
                  ensure_ascii = False, 
                  indent = 4)
    #
    #       ⅵ. Write a small JSON recording the last article collected and
    #          the timestamp of its collection.
    with io.open('./data/strat_mgt_j_time_check.json', 'w', encoding = 'utf-8') as f:
        json.dump({'article_tuple' : corpus_idx.get(art_key).get('art_tuple'),
                   'time_collected' : article.get('time_collected')}, 
                  f, 
                  ensure_ascii = False, 
                  indent = 4)
    #
    #       ⅶ. Push all of the changed objects up to a GitHub repository.
#     ! git add .
#     ! git commit -m 'strat mgt j full-text articles'
#     ! git push origin master
    pass


In [12]:
os.chdir('/Users/nahamlet/Documents/GitHub/uncertainty-research/BizStratTopicAnalysis')
#        ⓕ Scan through the corpus index `corpus_idx` and extract full-text articles.
#           We employ a while loop to identify control and find a stopping point.
#           For each article we save our results to local storage and upload changes
#           to GitHub. This way we continuously save corpus-acquisition state
#           and are able to pick up where we left off should our session be
#           interrupted.
#           ⅰ. Load the `corpus_idx` JSON from local storage.
corpus_idx = json.load(open('./data/strat_mgt_j_idx.json'))
#
#           ⅱ. Filter the `corpus_idx` dictionary to obtain only those
#              articles for which the `collected` status is False.
corp_to_do = {idx_key : corp_item
                for (idx_key, corp_item) in corpus_idx.items()
                if not corp_item.get('collected')}
#
#           ⅲ. Randomly select one article from the `corp_to_do` dictionary.
art_key = np.random.choice(list(corp_to_do.keys())).item(0)
#
#           ⅳ. Invoke the `add_full_text_to_json` locally-defined function
#              to add the full-text article to the JSON.  This
#              also changes the state for our article in the `corpus_idx`
#              dictionary to `collected` as True.
add_full_text_to_json(art_key)
#
#           ⅴ. Write the updated `corpus_idx` JSON to local storage.
with io.open('./data/strat_mgt_j_idx.json', 'w', encoding = 'utf-8') as f:
    json.dump(corpus_idx, 
              f, 
              ensure_ascii = False, 
              indent = 4)
#
#           ⅵ. We now cycle through `corpus_idx` until all articles are collected.
control_counter = 0
while len(corp_to_do) > 0:
    control_counter += 1
    time.sleep(np.random.uniform(low = 7.5,                      # ⇚ Random sleep interval.
                                 high = 15))
    corpus_idx = json.load(open('./data/strat_mgt_j_idx.json'))  # ⇚ ⅰ. Load `corpus_idx`
    corp_to_do = {idx_key : corp_item                     
                    for (idx_key, corp_item) in corpus_idx.items()
                    if not corp_item.get('collected')}           # ⇚ ⅱ. Filter `corpus_idx`
    art_key = np.random.choice(list(corp_to_do.keys())).item(0)  # ⇚ ⅲ. Random article
    try: 
        add_full_text_to_json(art_key)                           # ⇚ ⅵ. Article text to JSON
    except:
        print('failed ' + datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ'))
        driver.get('https://onlinelibrary.wiley.com/loi/10970266') # ⇚ Safe page if download fails
    with io.open('./data/strat_mgt_j_idx.json', 'w', encoding = 'utf-8') as f:
        json.dump(corpus_idx, 
                  f, 
                  ensure_ascii = False, 
                  indent = 4)                                    # ⇚ ⅴ. Save `corpus_idx`
    with io.open('./data/strat_mgt_j_articles_done.json', 'w', encoding = 'utf-8') as f:
        json.dump({idx_key : corp_item                     
                    for (idx_key, corp_item) in corpus_idx.items()
                    if corp_item.get('collected')}, 
                  f, 
                  ensure_ascii = False, 
                  indent = 4)                                    # ⇚ ⅴ. Save `corpus_idx`
#     if control_counter >= 5:
#         break

(('1987 - Volume 8', 'Volume 8, Issue 2', '6'), '20-06-12, 030323Z')


AttributeError: 'NoneType' object has no attribute 'end'

In [15]:
corpus_idx = json.load(open('./data/strat_mgt_j_idx.json'))

{idx : art_profile
    for (idx, art_profile) in corpus_idx.items()
    if not art_profile.get('collected')}

{'2116': {'art_tuple': ['1997 - Volume 18', 'Volume 18, Issue 9', '4'],
  'collected': False},
 '2132': {'art_tuple': ['1997 - Volume 18', 'Volume 18, Issue S1', '5'],
  'collected': False},
 '2147': {'art_tuple': ['1997 - Volume 18', 'Volume 18, Issue 4', '0'],
  'collected': False},
 '2333': {'art_tuple': ['1994 - Volume 15', 'Volume 15, Issue S2', '3'],
  'collected': False},
 '2566': {'art_tuple': ['1990 - Volume 11', 'Volume 11, Issue 6', '2'],
  'collected': False},
 '2618': {'art_tuple': ['1989 - Volume 10', 'Volume 10, Issue 4', '6'],
  'collected': False},
 '2749': {'art_tuple': ['1987 - Volume 8', 'Volume 8, Issue 2', '6'],
  'collected': False},
 '2762': {'art_tuple': ['1986 - Volume 7', 'Volume 7, Issue 6', '2'],
  'collected': False},
 '2964': {'art_tuple': ['1981 - Volume 2', 'Volume 2, Issue 2', '5'],
  'collected': False},
 '2989': {'art_tuple': ['1980 - Volume 1', 'Volume 1, Issue 3', '4'],
  'collected': False},
 '3003': {'art_tuple': ['1980 - Volume 1', 'Volume 1, Is

In [16]:
len({idx : art_profile
    for (idx, art_profile) in corpus_idx.items()
    if not art_profile.get('collected')})

11

In [17]:
smj_volumes.keys()

dict_keys(['2020 - Volume 41', '2019 - Volume 40', '2018 - Volume 39', '2017 - Volume 38', '2016 - Volume 37', '2015 - Volume 36', '2014 - Volume 35', '2013 - Volume 34', '2012 - Volume 33', '2011 - Volume 32', '2010 - Volume 31', '2009 - Volume 30', '2008 - Volume 29', '2007 - Volume 28', '2006 - Volume 27', '2005 - Volume 26', '2004 - Volume 25', '2003 - Volume 24', '2002 - Volume 23', '2001 - Volume 22', '2000 - Volume 21', '1999 - Volume 20', '1998 - Volume 19', '1997 - Volume 18', '1996 - Volume 17', '1995 - Volume 16', '1994 - Volume 15', '1993 - Volume 14', '1992 - Volume 13', '1991 - Volume 12', '1990 - Volume 11', '1989 - Volume 10', '1988 - Volume 9', '1987 - Volume 8', '1986 - Volume 7', '1985 - Volume 6', '1984 - Volume 5', '1983 - Volume 4', '1982 - Volume 3', '1981 - Volume 2', '1980 - Volume 1'])

In [28]:
art_idx = np.random.choice(a = list(corpus_idx.keys()),
                           size = 1).item(0)
(vol, issue, art) =  tuple(corpus_idx.get(art_idx).get('art_tuple'))

In [29]:
smj_volumes.get(vol).get('issues').get(issue).get('issue_toc').get(art).get('word_count')

4018

In [39]:
corpus_word_count = dict()
for art_idx in corpus_idx.keys():
    (vol, issue, art) =  tuple(corpus_idx.get(art_idx).get('art_tuple'))
    if smj_volumes.get(vol).get('issues').get(issue).get('issue_toc').get(art).get('word_count') is not None:
        corpus_word_count.update({art_idx : smj_volumes.get(vol).get('issues').get(issue).get('issue_toc').get(art).get('word_count')})

In [40]:
sum(list(corpus_word_count.values()))

21367368

In [31]:
{art_idx : smj_volumes.get(vol_iss_art[0]).get('issues').get(vol_iss_art[1]).get('issue_toc').get(vol_iss_art[2]).get('word_count')
    for art_idx in np.random.choice(a = list(corpus_idx.keys()),
                                    size = 5)
    for vol_iss_art in tuple(corpus_idx.get(art_idx).get('art_tuple'))
    if }

AttributeError: 'NoneType' object has no attribute 'get'

🛑⛔️❌🚫❌⛔️🛑⛔️❌🚫❌⛔️🛑⛔️❌🚫❌⛔️🛑⛔️❌🚫❌⛔️🛑⛔️❌🚫❌⛔️🛑⛔️❌🚫❌⛔️🛑⛔️❌🚫❌⛔️🛑

### Here, we try to reproduce [[Moung, 2019]](https://medium.com/@moungpeter/how-to-automate-downloading-files-using-python-selenium-and-headless-chrome-9014f0cdd196).

🛑⛔️❌🚫❌⛔️🛑⛔️❌🚫❌⛔️🛑⛔️❌🚫❌⛔️🛑⛔️❌🚫❌⛔️🛑⛔️❌🚫❌⛔️🛑⛔️❌🚫❌⛔️🛑⛔️❌🚫❌⛔️🛑

### Work below is Requests-based.  Only gets titles, abstracts