## Purpose.

Here we collect abstracts and titles from [*Academy of Strategic Management Journal*](https://www.abacademies.org/journals/academy-of-strategic-management-journal-home.html).  This information is to be used as part of an attempt to apply text classification to charting the progression of business strategy.

## Approach.

Titles and abstracts are available without paywall login. But we have to do this in ***three stages***.  

⓵ **Get the URL for the each volume**. We start with the journal's hope page.  Our essential information is embedded in a frame depicted below on the right-hand side of the page. 

⓶ **Get a list of issue URLs**.  Each volume page contains thumbnail images of individual issues. These include URLs to the individual isses. 

⓷ **Collect lists of titles**.  Follow each issue's URL to its issue table of contents. The tables of contents contain titles, as well as URLs to pages for individual articles.

⓸ **Collect abstracts**.  Abstracts are accessible from individual articles. We have to get the abstracts from these individual-article pages.






In [1]:
# Import libaries
import requests as req
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import time
import itertools as it
import json
import io
from copy import deepcopy
import datetime

In [2]:
# We use lots of list-comprehension, which drives requests.get operations.
# We need to "throttle" these, so as to avoid the appearance of a DDoS
# attack.  We accomplish this by a get_sleep function.  This function 
# executes a req.get operation, returning the result, with a one-second
# delay.
def sleep_get(url, headers):
    time.sleep(np.random.uniform(low = 0.5,
                                 high = 2.5))
    return(req.get(url,
                   headers = headers))
#
# Partition a list into a specified number of bins.  Our inputs
# are:
# ⧐ parted_list is the list to be partitioned;
# ⧐ partition_counts specifies the number of bins into which
#   parted_list is divided.
# We produce an enumerated dictionary of the list partitions.
def partition_list(parted_list, partition_counts):
    parted_list = np.sort(np.array(parted_list))
    partition_len = int(np.ceil(len(parted_list)/partition_counts))
    partitions = [np.array(object = range(partition_len)) + part * partition_len
                     for part in range(partition_counts)]
    partitions[-1] = np.arange(start = partitions[-1][0],
                               stop = parted_list.shape[0])
    return dict(enumerate([list(parted_list[part])
                             for part in partitions]))





In [75]:
asmjUrl = 'https://www.abacademies.org/journals/'
headers = {
    'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
asmjHtml = sleep_get(asmjUrl + 'academy-of-strategic-management-journal-home.html',
                            headers = headers)
asmjSoup = BeautifulSoup(asmjHtml.content, 'lxml')




In [76]:
# With this issue, all of the issues appear on a single page. We simply have
# to get these.

asmjTitleAbstract = {volume.find('h4').text : {'issueToc' :  {issue.text : {'href' : issue.attrs.get('href')}
                                                for issue in volume.find_all('a')}}
                        for volume in asmjSoup.find_all('div', {'class' : 'card-body'})}

In [77]:
volume = np.random.choice(a = list(asmjTitleAbstract.keys()),
                          size = 1).item(0)
issue = np.random.choice(a = list(asmjTitleAbstract.get(volume).get('issueToc').keys()),
                         size = 1).item(0)
print(f'Volume {volume}, issue {issue}')
asmjUrl + asmjTitleAbstract.get(volume).get('issueToc').get(issue).get('href')

Volume 2011, issue Volume 10, Issue 2


'https://www.abacademies.org/journals/month-december-year-2011-vol-10-issue-2-journal-asmj-past-issue.html'

In [78]:
asmjTitleAbstract.get(volume)

{'issueToc': {'Volume 10, Issue 2': {'href': 'month-december-year-2011-vol-10-issue-2-journal-asmj-past-issue.html'},
  'Volume 10, Issue 1': {'href': 'month-june-year-2011-vol-10-issue-1-journal-asmj-past-issue.html'}}}

In [54]:
list(asmjTitleAbstract.keys())

['2020',
 '2019',
 '2018',
 '2017',
 '2016',
 '2015',
 '2014',
 '2013',
 '2012',
 '2011',
 '2010',
 '2009',
 '2008',
 '2007',
 '2006',
 '2005',
 '2004',
 '2003',
 '2002']

In [55]:
asmjTitleAbstract

{'2020': {'issueToc': {'Issue 1': {'href': 'month-february-year-2020-vol-19-issue-1-journal-asmj-past-issue.html'}}},
 '2019': {'issueToc': {'Issue 1': {'href': 'month-february-year-2019-vol-18-issue-1-journal-asmj-past-issue.html'}}},
 '2018': {'issueToc': {'Issue 1': {'href': 'month-february-year-2018-vol-17-issue-1-journal-asmj-past-issue.html'}}},
 '2017': {'issueToc': {'Special Issue 1': {'href': 'https://www.abacademies.org/special-issues/volume-16-special-issue-1.html'}}},
 '2016': {'issueToc': {'Issue 2': {'href': 'month-september-year-2016-vol-15-issue-2-journal-asmj-past-issue.html'}}},
 '2015': {'issueToc': {'Issue 1': {'href': 'month-june-year-2015-vol-14-issue-1-journal-asmj-past-issue.html'}}},
 '2014': {'issueToc': {'Issue 1': {'href': 'month-june-year-2014-vol-13-issue-1-journal-asmj-past-issue.html'}}},
 '2013': {'issueToc': {'Issue 1': {'href': 'month-june-year-2013-vol-12-issue-1-journal-asmj-past-issue.html'}}},
 '2012': {'issueToc': {'Issue 1': {'href': 'month-june

In [None]:
# Issues are organized by calendar year.  All of the calendar-year linkes are stored in
# `a`-taged html tags with class `title expander2 yearExpander`.  Assemble this into 
# a dictionary.
stratOrgTitleAbstr = {calYear.text : {'href' : calYear.attrs.get('href')}
                        for calYear in stratOrgUrlSoup.find_all('a', {'class' : 'title expander2 yearExpander'})}


stratOrgTitleAbstr

In [None]:
# Collect urls for individual-issue ToCs.  Associated with each calendar year we have
# a URL to a page with each of the issues. We get an issue label and the URL for its 
# table of contents.

calYearDone = list()
calYearMissed = list()
for calYear in stratOrgTitleAbstr.keys():
    try:
        calYearHtml = sleep_get(stratOrgUrl + stratOrgTitleAbstr.get(calYear).get('href'),
                                headers = headers)
        stratOrgTitleAbstr.get(calYear).update(
            {'issueToC' : {issue.find('a').get_text(strip = True) : {'href' : issue.find('a').attrs.get('href')}
                                for issue in BeautifulSoup(calYearHtml.content, "lxml").find_all('h6')}})
        calYearDone.append(calYear)
        print('{} success at {}'.format(calYear, 
                                      datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))
    except:
        calYearMissed.append(calYear)
        print('{} failure at {}'.format(calYear, 
                                      datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))


In [None]:
stratOrgTitleAbstr

In [None]:
def joinDict(baseline, update):
    return {key : val
               for dictObjs in [baseline,
                                update]
               for (key, val) in dictObjs}

def getIssueArticleTitleAbstr(calYear, issue):
    issueHtml = sleep_get(stratOrgTitleAbstr.get(calYear).get('issueToC').get(issue).get('href'),
                          headers = headers)
    issueSoup = BeautifulSoup(issueHtml.content,
                              'lxml').find('form', {'name' : 'frmAbs'})
    issueDict = stratOrgTitleAbstr.get(calYear).get('issueToC').get(issue)
    issueDict = dict(enumerate([{'title' : article.find('span', {'class' : 'hlFld-Title'})\
                                                  .text\
                                                  .lower(),
                                  'published' : article.find('span', {'class' : 'tocEPubDate'})\
                                                       .text\
                                                       .replace('First Published ', ''),
                                  'abstrURL' : article.find('div', {'class' : 'abstract-section'})\
                                                      .find('a')\
                                                      .attrs
                                                      .get('href')}
                                    for article in issueSoup.find_all('td', {'valign' : 'top'})
                                    if article.find('span', {'class' : 'hlFld-Title'}) is not None]))
    for article in issueDict.keys():
        articleDict = issueDict.get(article)
        try: 
            artAbstrHtml = sleep_get(stratOrgUrl + issueDict.get(article).get('abstrURL'),
                                     headers = headers)
            artAbstrSoup = BeautifulSoup(artAbstrHtml.content,
                                         'lxml').find('div', {'class' : 'abstractSection abstractInFull'})
            issueDict.get(article).update({'abstract' : artAbstrSoup.text\
                                                                    .lower()})
        except:
            issueDict.get(article).update({'abstract' : str()})
    return issueDict


In [None]:
calYear = np.random.choice(a = list(stratOrgTitleAbstr.keys()),
                           size = 1).item(0)
issue = np.random.choice(a = list(stratOrgTitleAbstr.get(calYear).get('issueToC').keys()),
                         size = 1).item(0)
print('Calendar Year {}  issue {}\nURL {} '\
      .format(calYear, issue, stratOrgTitleAbstr.get(calYear).get('issueToC').get(issue).get('href')))

In [None]:
for calYear in stratOrgTitleAbstr.keys():
    issuesDone = list()
    issuesMissed = list()
    for issue in stratOrgTitleAbstr.get(calYear).get('issueToC').keys():
        try:
            stratOrgTitleAbstr.get(calYear)\
                             .get('issueToC')\
                             .get(issue)\
                             .update({'titleAbstr' : getIssueArticleTitleAbstr(calYear, issue)})
            issuesDone.append(', '.join([calYear, issue]))
            print('{}, {} success at {}'.format(calYear,
                                                    issue,
                                                    datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))
        except:
            stratOrgTitleAbstr.get(calYear)\
                             .get('issueToC')\
                             .get(issue)\
                             .update({'titleAbstr' : dict()})
            issuesMissed.append(', '.join([calYear, issue]))
            print('{}, {} failure at {}'.format(calYear,
                                                    issue,
                                                    datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))
        with io.open('./data/stratOrgTitleAbstract.json', 'w', encoding = 'utf-8') as f:
            json.dump(stratOrgTitleAbstr, 
                      f, 
                      ensure_ascii = False, 
                      indent = 4)

            

In [None]:
stratOrgTitleAbstr.get(calYear)\
                 .get('issueToC')