## Purpose.

Here we collect abstracts and titles from [*Strategic Management Journal*](https://onlinelibrary.wiley.com/journal/10970266), flagship publication of the [Strategic Management Society](https://www.strategicmanagement.net/).  This information is to be used as part of an attempt to apply text classification to charting the progression of business strategy.

## Approach.

Titles and abstracts are available without paywall login. But we have to do this in ***three stages***.  

⓵ **Get the URL for the each volume**. We start with the journal's hope page.  Our essential information is embedded in a frame depicted below on the right-hand side of the page. 

⓶ **Get a list of issue URLs**.  Each volume page contains thumbnail images of individual issues. These include URLs to the individual isses. 

⓷ **Collect lists of titles**.  Follow each issue's URL to its issue table of contents. The tables of contents contain titles, as well as URLs to pages for individual articles.

⓸ **Collect abstracts**.  Abstracts are accessible from individual articles. We have to get the abstracts from these individual-article pages.






In [1]:
# Import libaries
import requests as req
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import time
import itertools as it
import json
import io
from copy import deepcopy
import datetime

In [2]:
# We use lots of list-comprehension, which drives requests.get operations.
# We need to "throttle" these, so as to avoid the appearance of a DDoS
# attack.  We accomplish this by a get_sleep function.  This function 
# executes a req.get operation, returning the result, with a one-second
# delay.
def sleep_get(url, headers):
    time.sleep(np.random.uniform(low = 0.75,
                                 high = 2.75))
    return(req.get(url,
                   headers = headers))
#
# Partition a list into a specified number of bins.  Our inputs
# are:
# ⧐ parted_list is the list to be partitioned;
# ⧐ partition_counts specifies the number of bins into which
#   parted_list is divided.
# We produce an enumerated dictionary of the list partitions.
def partition_list(parted_list, partition_counts):
    parted_list = np.sort(np.array(parted_list))
    partition_len = int(np.ceil(len(parted_list)/partition_counts))
    partitions = [np.array(object = range(partition_len)) + part * partition_len
                     for part in range(partition_counts)]
    partitions[-1] = np.arange(start = partitions[-1][0],
                               stop = parted_list.shape[0])
    return dict(enumerate([list(parted_list[part])
                             for part in partitions]))


In [3]:
smj_url = 'https://onlinelibrary.wiley.com'
headers = {
    'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
smj_html = sleep_get(smj_url + '/loi/10970266',
                   headers = headers)
smj_soup = BeautifulSoup(smj_html.content, 'lxml')

In [4]:
smj_url + '/loi/10970266'

'https://onlinelibrary.wiley.com/loi/10970266'

In [5]:
# ⓵ Obtain the URLs for each volume. These located on the journal's homepage.
#    collect this in a dictionary labeled `smj_volumes`.  This is a simple 
#    nested dictionary of {'volume label ' : {'href' : vol_url}}.
#    The vol_url attribute is appended to our smj url to get the
#    URL for the issue table of contents.
smj_volumes = {issues.find('a').attrs.get('title') : {'href' : issues.find('a').attrs.get('href')}
                                                     for issues in smj_soup.find('div', {'class' : "loi--aside__left"})\
                                                                           .find_all('li')
                                                     if issues.find('a').attrs.get('href') != '#' }



In [6]:
smj_volumes

{'2020 - Volume 41': {'href': '/loi/10970266/year/2020'},
 '2019 - Volume 40': {'href': '/loi/10970266/year/2019'},
 '2018 - Volume 39': {'href': '/loi/10970266/year/2018'},
 '2017 - Volume 38': {'href': '/loi/10970266/year/2017'},
 '2016 - Volume 37': {'href': '/loi/10970266/year/2016'},
 '2015 - Volume 36': {'href': '/loi/10970266/year/2015'},
 '2014 - Volume 35': {'href': '/loi/10970266/year/2014'},
 '2013 - Volume 34': {'href': '/loi/10970266/year/2013'},
 '2012 - Volume 33': {'href': '/loi/10970266/year/2012'},
 '2011 - Volume 32': {'href': '/loi/10970266/year/2011'},
 '2010 - Volume 31': {'href': '/loi/10970266/year/2010'},
 '2009 - Volume 30': {'href': '/loi/10970266/year/2009'},
 '2008 - Volume 29': {'href': '/loi/10970266/year/2008'},
 '2007 - Volume 28': {'href': '/loi/10970266/year/2007'},
 '2006 - Volume 27': {'href': '/loi/10970266/year/2006'},
 '2005 - Volume 26': {'href': '/loi/10970266/year/2005'},
 '2004 - Volume 25': {'href': '/loi/10970266/year/2004'},
 '2003 - Volum

In [7]:

# ⓶ Get the tables of contents for each issue of each volumne. We add this to
#    our smj_volumes dictionary so that our dictionary look like
#      {'volumne label' : {'href' : vol_url,
#                          'issues' : {'issue label' : 'issue ToC URL'}}}.
#    For compactness of logic, we construct a function `get_issue_tocs`
#    to get the actual tables of contents.  We then invoke this attribute 
#    via dictionary comprehension to get 
def get_issue_tocs(issue_url):
    # First we turn the issue URL into a soup object, using our sleep_get 
    # locally-defined function.
    issue_soup = BeautifulSoup(sleep_get(smj_url + issue_url,
                                          headers = headers).content,
                               'lxml')
    #
    # We next get to enumerated dictionaries.  The first contains all of the 
    # cover-date instances.  These are demarcated by a <span>-tagged object,
    # for which the `cover-date-value` class is assigned.  Secondly, we
    # retrieve all instances of <h4>-tagged objects with a `parent-item`
    # class label. These contain the issue information as well as the url to its ToC.
    cover_date = dict(enumerate(issue_soup.find_all('span', {'class' : 'cover-date-value'})))
    issue_href = dict(enumerate(issue_soup.find_all('h4', {'class' : 'parent-item'})))
    #
    # Now, merge the two dictionaries. In the process, we have to turn the issue_href
    # inside-out.  We ant the text item, the issue information, to be our dictionary keys.
    # The url to the issue ToC is an attribute of a <a>-tagged item. 
    return\
    {issue_attrs.text : {'href': issue_attrs.find('a').attrs.get('href'),
                         'cover_date' : cover_date.get(issue_idx).text}
                    for (issue_idx, issue_attrs) in issue_href.items()}



smj_volumes.update({jrnl_vol : {'href' : vol_contents.get('href'),
                                 'issues' : get_issue_tocs(vol_contents.get('href'))}
                        for (jrnl_vol, vol_contents) in smj_volumes.items()})

In [8]:
def get_article_abstract(abstract_url):
    # We get the article abstracts given the URL on which they are located.
    # We obtain these from the title_abstract_url object produced internal 
    # to the get_article_title_abst_url locally-defined function. 
    # Given this we get the html from the abstract page.  
    #
    # The abstract itself is embedded in a <div>-tagged object to which
    # the label
    #          `article-section__content en main`
    # is applied.  The text of this contains our abstract itself.
    # We apply some cleanup, removing new-line tags and 
    # summary headers.
    abstract_soup = \
    BeautifulSoup(sleep_get(smj_url + abstract_url,
                            headers = headers).content,
                  'lxml')
    abstract = abstract_soup.find('div', {'class' : 'article-section__content en main'})\
                            .text\
                            .replace('\n', ' ')\
                            .replace('research summary:', ' ')\
                            .replace('managerial summary:', ' ')\
                            .strip()
    return abstract

In [14]:
volume = np.random.choice(a = list(smj_volumes.keys()),
                          size = 1).item(0)
issue = np.random.choice(a = list(smj_volumes.get(volume).get('issues').keys()),
                         size = 1).item(0)
print(f'Volume : {volume},   Issue : {issue}')

smj_volumes.get(volume).get('issues').get(issue)

issue_attrs = smj_volumes.get(volume).get('issues').get(issue)
issue_url = issue_attrs.get('href')
print(smj_url + issue_url)

Volume : 2019 - Volume 40,   Issue : Volume 40, Issue 1
https://onlinelibrary.wiley.com/toc/10970266/2019/40/1


In [18]:
issue_ToC_soup =\
BeautifulSoup(sleep_get(smj_url + issue_url,
                          headers = headers).content,
             'lxml')\
                            .find_all('div', {'class' : 'card issue-items-container exportCitationWrapper'})
issue_ToC_secs_soup = dict(zip([toc_sec.find('h3').attrs.get('title')
                                        for toc_sec in issue_ToC_soup],
                                 issue_ToC_soup))
#
# Identify the ToC sections for which constituent articles have abstracts.
abstract_bearing_ToC_secs = ['articles', 'research notes and communication', 
                             'research articles', 'commentary',
                             'SPECIAL ISSUE ARTICLES'.lower(),
                             'introduction']
#
# Filter ToC sections, retaining only those for which abstracts are included.
issue_ToC_secs_soup = {section : issue_item_container
                         for (section, issue_item_container) in issue_ToC_secs_soup.items()
                         if section.lower() in abstract_bearing_ToC_secs}
#
# We now go after article titles and abstract URLs.  The essential information 
# is contained in <div> tabs assigned `issue-item` class tags.  We want to collect
# these from all of the ToC sections.  We collect the results in an enumerated
# dictionary.  Since our issue ToC may have more than one abstract-beraing section,
# our list-comprehension gives us a list of lists. We use itertools.chain to
# flatten into single list.
issue_item_soup = dict(enumerate(it.chain(*[issue_item.find_all('div', {'class' : 'issue-item'})
                                                for issue_item in issue_ToC_secs_soup.values()])))


<div class="issue-item">
<div class="pull-left">
<div class="doi-access-wrapper"><div class="free-access access-type"> <i aria-hidden="true" class="icon-icon-lock_open"></i> <div class="doi-access"> <div class="doi-access" tabindex="0">Free Access</div></div></div></div>
</div>
<div class="clearfix"></div>
<div class="clearfix"></div>
<div class="input-group exportCitationCheckbox js--hidden"><label class="checkbox--primary"><input data-type="exportCitation" name="exportCitation" title="Select article for bulk download or export: Multinational investment and the value of growth options: Alignment of incremental strategy to environmental uncertainty" type="checkbox" value="10.1002/smj.2969"/><span class="label-txt"></span></label></div><a class="issue-item__title visitable" href="/doi/10.1002/smj.2969">
<h2>Multinational investment and the value of growth options: Alignment of incremental strategy to environmental uncertainty</h2></a><ul class="rlist--inline loa comma loa-authors-trunc"

In [22]:
article = np.random.choice(a = list(issue_item_soup.keys()),
                           size = 1).item(0)
article

3

In [9]:
volume = np.random.choice(a = list(smj_volumes.keys()),
                          size = 1).item(0)
issue = np.random.choice(a = list(smj_volumes.get(volume).get('issues').keys()),
                         size = 1).item(0)
print(f'Volume : {volume},   Issue : {issue}')

smj_volumes.get(volume).get('issues').get(issue)

issue_attrs = smj_volumes.get(volume).get('issues').get(issue)
issue_url = issue_attrs.get('href')
print(issue_url)


def get_article_title_abst_url(issue_url):
    # We begin with issue_url, the url to the issue table-of-contents page.
    # The ToC is in a table-like format of hierarchical <div>-type objects
    # The delimiters of interest are the class labeled
    #      `card issue-items-container exportCitationWrapper`.
    # 
    # Now, multiple types of these `card issue-items container` occur. 
    # Not all of them contain abstracts.  We need to sort out the ones that
    # interest us and leave the rest. The ToC sections are distinguished
    # by <h3> tags, for which `title` is an attribute. We get these, and 
    # use them as keys, for which the corresponding values are the
    # totality of the issue-items containers.
    issue_ToC_soup =\
    BeautifulSoup(sleep_get(smj_url + issue_url,
                              headers = headers).content,
                 'lxml')\
                                .find_all('div', {'class' : 'card issue-items-container exportCitationWrapper'})
    issue_ToC_secs_soup = dict(zip([toc_sec.find('h3').attrs.get('title')
                                            for toc_sec in issue_ToC_soup],
                                     issue_ToC_soup))
    #
    # Identify the ToC sections for which constituent articles have abstracts.
    abstract_bearing_ToC_secs = ['articles', 'research notes and communication', 
                                 'research articles', 'commentary',
                                 'SPECIAL ISSUE ARTICLES'.lower(),
                                 'introduction']
    #
    # Filter ToC sections, retaining only those for which abstracts are included.
    issue_ToC_secs_soup = {section : issue_item_container
                             for (section, issue_item_container) in issue_ToC_secs_soup.items()
                             if section.lower() in abstract_bearing_ToC_secs}
    #
    # We now go after article titles and abstract URLs.  The essential information 
    # is contained in <div> tabs assigned `issue-item` class tags.  We want to collect
    # these from all of the ToC sections.  We collect the results in an enumerated
    # dictionary.  Since our issue ToC may have more than one abstract-beraing section,
    # our list-comprehension gives us a list of lists. We use itertools.chain to
    # flatten into single list.
    issue_item_soup = dict(enumerate(it.chain(*[issue_item.find_all('div', {'class' : 'issue-item'})
                                                    for issue_item in issue_ToC_secs_soup.values()])))
    #
    # Next, we get the title and the abstract URL. The title is contained in an <h2>-tagged object.
    # The URL to the abstract page is buried a bit deeper.  It resides in a <div>-tagged
    # object for which the class label
    #            `content-item-format-links`
    # is assigned. This contains a list, the first element of which is the abstract.  We get
    # the corresponding attribute from a <a>-labeled object.
    issue_items = {article_idx : {'title' : item_issue.find('h2').text,
                                   'abstract_url' : item_issue.find('div', {'class' : 'content-item-format-links'})\
                                                              .find('a')\
                                                              .attrs\
                                                              .get('href'),
                                   'link_type'    : item_issue.find('div', {'class' : 'content-item-format-links'})\
                                                              .find('a')\
                                                              .attrs\
                                                              .get('title').lower()}
                                    for (article_idx, item_issue) in issue_item_soup.items() } 
    #
    # We need to filter our dictionary.  Sometimes an article contains something other than
    # an abstract.  Here, we drop articles for which no abstract is provided.
    issue_items = {issue_idx : issue_attrs
                        for (issue_idx, issue_attrs) in issue_items.items()
                        if issue_attrs.get('link_type') == 'abstract'}
    #
    # We finally get the abstracts themselves. We use locally-defined function 
    # get_article_abstract to scrape this text block out of the location specified
    # by the corresponding URL
    issue_abstracts = {article_idx : {'abstract' : get_article_abstract(attribute.get('abstract_url'))}
                                        for (article_idx, attribute) in issue_items.items()}
    #
    # We now assemble all of the information by merging the dictionaries.
    issue_title_abstr = {issue_idx : {'title' : issue_attrs.get('title').lower(),
                                      'abstract_url' : issue_attrs.get('abstract_url'),
                                      'abstract' : issue_abstracts.get(issue_idx).get('abstract').lower()}
                         for (issue_idx, issue_attrs) in issue_items.items()}
    return issue_title_abstr




Volume : 1991 - Volume 12
Issue : Volume 12, Issue S1


In [10]:
smj_volumes_short = dict()
rand_select_vols = np.random.choice(a = list(smj_volumes.keys()),
                                    size = 2,
                                    replace = False)
print(f'Randomly-selected volumes {rand_select_vols}')
smj_volumes_short = {volume : vol_attrs
                        for (volume, vol_attrs) in smj_volumes.items()
                         if volume in rand_select_vols}
smj_volumes_short

Randomly-selected volumes ['1988 - Volume 9' '1996 - Volume 17']


{'1996 - Volume 17': {'href': '/loi/10970266/year/1996',
  'issues': {'Volume 17, Issue S2': {'href': '/toc/10970266/1996/17/S2',
    'cover_date': 'Winter 1996'},
   'Volume 17, Issue 9': {'href': '/toc/10970266/1996/17/9',
    'cover_date': 'November 1996'},
   'Volume 17, Issue 8': {'href': '/toc/10970266/1996/17/8',
    'cover_date': 'October 1996'},
   'Volume 17, Issue 7': {'href': '/toc/10970266/1996/17/7',
    'cover_date': 'July 1996'},
   'Volume 17, Issue 6': {'href': '/toc/10970266/1996/17/6',
    'cover_date': 'June 1996'},
   'Volume 17, Issue S1': {'href': '/toc/10970266/1996/17/S1',
    'cover_date': 'Summer 1996'},
   'Volume 17, Issue 5': {'href': '/toc/10970266/1996/17/5',
    'cover_date': 'May 1996'},
   'Volume 17, Issue 4': {'href': '/toc/10970266/1996/17/4',
    'cover_date': 'April 1996'},
   'Volume 17, Issue 3': {'href': '/toc/10970266/1996/17/3',
    'cover_date': 'March 1996'},
   'Volume 17, Issue 2': {'href': '/toc/10970266/1996/17/2',
    'cover_date': '

In [12]:
volume = np.random.choice(a = list(smj_volumes.keys()),
                                    size = 1,
                                    replace = False).item(0)

In [11]:
smj_vols_titles_abstracts = deepcopy(smj_volumes_short)
print('{} volumes started at {}'.format(str(len(smj_vols_titles_abstracts)), 
                                        datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))
for (vols, vol_attrs) in smj_vols_titles_abstracts.items():
    issue_done = list()
    issue_missed = list()
    for (issue, issue_attrs) in vol_attrs.get('issues').items():
        try:
            issue_attrs.update({'issue_toc' : get_article_title_abst_url(issue_attrs.get('href'))})
            issue_done.append((issue, datetime.datetime.utcnow().strftime('%H%M%SZ')))
            print('{}, {} articles, success at {}'.format(issue, 
                                                          str(len(issue_attrs.get('issue_toc'))),
                                                          datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))

        except:
            issue_missed.append((issue, datetime.datetime.utcnow().strftime('%H%M%SZ')))
            print('{} failure at {}'.format(issue, datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))
    vol_attrs.update({'issue_done' : issue_done,
                      'issue_missed' : issue_missed})
    with io.open('./data/smj_title_abstr.json', 'w', encoding = 'utf-8') as f:
        json.dump(smj_vols_titles_abstracts, 
                  f, 
                  ensure_ascii = False, 
                  indent = 4)

2 volumes started at 20-03-02, 172019Z
Volume 17, Issue S2, 11 articles, success at 20-03-02, 172058Z
Volume 17, Issue 9, 3 articles, success at 20-03-02, 172111Z
Volume 17, Issue 8, 4 articles, success at 20-03-02, 172146Z
Volume 17, Issue 7 failure at 20-03-02, 172149Z
Volume 17, Issue 6, 4 articles, success at 20-03-02, 172207Z
Volume 17, Issue S1, 10 articles, success at 20-03-02, 172243Z
Volume 17, Issue 5, 4 articles, success at 20-03-02, 172259Z
Volume 17, Issue 4, 4 articles, success at 20-03-02, 172312Z
Volume 17, Issue 3, 0 articles, success at 20-03-02, 172318Z
Volume 17, Issue 2, 0 articles, success at 20-03-02, 172320Z
Volume 17, Issue 1, 5 articles, success at 20-03-02, 172344Z


FileNotFoundError: [Errno 2] No such file or directory: './data/smj_title_abstr.json'

In [None]:
smj_vols_titles_abstracts

In [None]:
##################################################################
##🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑##
# smj_titles_abstracts = {vol : {'href' : vol_attrs.get('href'),
#                              'issues' : {issue : {'href' : iss_attr.get('href'),
#                                                   'cover_date' : iss_attr.get('cover_date'),
#                                                   'issue_toc' : get_article_title_abst_url(iss_attr.get('href'))}
#                                          for (issue, iss_attr) in vol_attrs.get('issues').items()}}
#                             for (vol, vol_attrs) in smj_vols_short.items()}
##🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑##
##################################################################

In [None]:
issue_attrs

In [14]:



# volume = np.random.choice(a = list(smj_volumes_short.keys()),
#                           size = 1).item(0)
# volume_attrs = smj_volumes_short.get(volume)

# issue = np.random.choice(a = list(volume_attrs.get('issues').keys()),
#                           size = 1).item(0)

# issue_attrs = volume_attrs.get('issues').get(issue)



# print(f'Volume {volume}\nIssue {issue}')

