## Purpose.

Here we collect abstracts and titles from [*Strategic Organization*](https://journals.sagepub.com/home/soq).  This information is to be used as part of an attempt to apply text classification to charting the progression of business strategy.

## Approach.

Titles and abstracts are available without paywall login. But we have to do this in ***three stages***.  

⓵ **Get the URL for the each volume**. We start with the journal's hope page.  Our essential information is embedded in a frame depicted below on the right-hand side of the page. 

⓶ **Get a list of issue URLs**.  Each volume page contains thumbnail images of individual issues. These include URLs to the individual isses. 

⓷ **Collect lists of titles**.  Follow each issue's URL to its issue table of contents. The tables of contents contain titles, as well as URLs to pages for individual articles.

⓸ **Collect abstracts**.  Abstracts are accessible from individual articles. We have to get the abstracts from these individual-article pages.






In [1]:
# Import libaries
import requests as req
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import time
import itertools as it
import json
import io
from copy import deepcopy
import datetime

In [2]:
# We use lots of list-comprehension, which drives requests.get operations.
# We need to "throttle" these, so as to avoid the appearance of a DDoS
# attack.  We accomplish this by a get_sleep function.  This function 
# executes a req.get operation, returning the result, with a one-second
# delay.
def sleep_get(url, headers):
    time.sleep(np.random.uniform(low = 0.5,
                                 high = 2.5))
    return(req.get(url,
                   headers = headers))
#
# Partition a list into a specified number of bins.  Our inputs
# are:
# ⧐ parted_list is the list to be partitioned;
# ⧐ partition_counts specifies the number of bins into which
#   parted_list is divided.
# We produce an enumerated dictionary of the list partitions.
def partition_list(parted_list, partition_counts):
    parted_list = np.sort(np.array(parted_list))
    partition_len = int(np.ceil(len(parted_list)/partition_counts))
    partitions = [np.array(object = range(partition_len)) + part * partition_len
                     for part in range(partition_counts)]
    partitions[-1] = np.arange(start = partitions[-1][0],
                               stop = parted_list.shape[0])
    return dict(enumerate([list(parted_list[part])
                             for part in partitions]))





In [3]:
stratOrgUrl = 'https://journals.sagepub.com'
headers = {
    'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
stratOrgUrlHtml = sleep_get(stratOrgUrl + '/loi/SOQ',
                            headers = headers)
stratOrgUrlSoup = BeautifulSoup(stratOrgUrlHtml.content, 'lxml')




In [4]:
# Issues are organized by calendar year.  All of the calendar-year linkes are stored in
# `a`-taged html tags with class `title expander2 yearExpander`.  Assemble this into 
# a dictionary.
stratOrgTitleAbstr = {calYear.text : {'href' : calYear.attrs.get('href')}
                        for calYear in stratOrgUrlSoup.find_all('a', {'class' : 'title expander2 yearExpander'})}


stratOrgTitleAbstr

{'2020': {'href': '/loi/SOQ?year=2020-2020'},
 '2019': {'href': '/loi/SOQ?year=2019'},
 '2018': {'href': '/loi/SOQ?year=2018'},
 '2017': {'href': '/loi/SOQ?year=2017'},
 '2016': {'href': '/loi/SOQ?year=2016'},
 '2015': {'href': '/loi/SOQ?year=2015'},
 '2014': {'href': '/loi/SOQ?year=2014'},
 '2013': {'href': '/loi/SOQ?year=2013'},
 '2012': {'href': '/loi/SOQ?year=2012'},
 '2011': {'href': '/loi/SOQ?year=2011'},
 '2010': {'href': '/loi/SOQ?year=2010'},
 '2009': {'href': '/loi/SOQ?year=2009'},
 '2008': {'href': '/loi/SOQ?year=2008'},
 '2007': {'href': '/loi/SOQ?year=2007'},
 '2006': {'href': '/loi/SOQ?year=2006'},
 '2005': {'href': '/loi/SOQ?year=2005'},
 '2004': {'href': '/loi/SOQ?year=2004'},
 '2003': {'href': '/loi/SOQ?year=2003'}}

In [5]:
# Collect urls for individual-issue ToCs.  Associated with each calendar year we have
# a URL to a page with each of the issues. We get an issue label and the URL for its 
# table of contents.

calYearDone = list()
calYearMissed = list()
for calYear in stratOrgTitleAbstr.keys():
    try:
        calYearHtml = sleep_get(stratOrgUrl + stratOrgTitleAbstr.get(calYear).get('href'),
                                headers = headers)
        stratOrgTitleAbstr.get(calYear).update(
            {'issueToC' : {issue.find('a').get_text(strip = True) : {'href' : issue.find('a').attrs.get('href')}
                                for issue in BeautifulSoup(calYearHtml.content, "lxml").find_all('h6')}})
        calYearDone.append(calYear)
        print('{} success at {}'.format(calYear, 
                                      datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))
    except:
        calYearMissed.append(calYear)
        print('{} failure at {}'.format(calYear, 
                                      datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))


2020 success at 20-02-29, 111521Z
2019 success at 20-02-29, 111523Z
2018 success at 20-02-29, 111526Z
2017 success at 20-02-29, 111530Z
2016 success at 20-02-29, 111532Z
2015 success at 20-02-29, 111534Z
2014 success at 20-02-29, 111536Z
2013 success at 20-02-29, 111538Z
2012 success at 20-02-29, 111542Z
2011 success at 20-02-29, 111544Z
2010 success at 20-02-29, 111547Z
2009 success at 20-02-29, 111550Z
2008 success at 20-02-29, 111553Z
2007 success at 20-02-29, 111554Z
2006 success at 20-02-29, 111556Z
2005 success at 20-02-29, 111558Z
2004 success at 20-02-29, 111602Z
2003 success at 20-02-29, 111605Z


In [6]:
stratOrgTitleAbstr

{'2020': {'href': '/loi/SOQ?year=2020-2020', 'issueToC': {}},
 '2019': {'href': '/loi/SOQ?year=2019',
  'issueToC': {'Issue 4,': {'href': 'https://journals.sagepub.com/toc/soqa/17/4'},
   'Issue 3,': {'href': 'https://journals.sagepub.com/toc/soqa/17/3'},
   'Issue 2,': {'href': 'https://journals.sagepub.com/toc/soqa/17/2'},
   'Issue 1,': {'href': 'https://journals.sagepub.com/toc/soqa/17/1'}}},
 '2018': {'href': '/loi/SOQ?year=2018',
  'issueToC': {'Issue 4,': {'href': 'https://journals.sagepub.com/toc/soqa/16/4'},
   'Issue 3,': {'href': 'https://journals.sagepub.com/toc/soqa/16/3'},
   'Issue 2,': {'href': 'https://journals.sagepub.com/toc/soqa/16/2'},
   'Issue 1, Special Issue: Strategic Management Theory and Universities,': {'href': 'https://journals.sagepub.com/toc/soqa/16/1'}}},
 '2017': {'href': '/loi/SOQ?year=2017',
  'issueToC': {'Issue 4,': {'href': 'https://journals.sagepub.com/toc/soqa/15/4'},
   'Issue 3,': {'href': 'https://journals.sagepub.com/toc/soqa/15/3'},
   'Iss

In [9]:
def joinDict(baseline, update):
    return {key : val
               for dictObjs in [baseline,
                                update]
               for (key, val) in dictObjs}

def getIssueArticleTitleAbstr(calYear, issue):
    issueHtml = sleep_get(stratOrgTitleAbstr.get(calYear).get('issueToC').get(issue).get('href'),
                          headers = headers)
    issueSoup = BeautifulSoup(issueHtml.content,
                              'lxml').find('form', {'name' : 'frmAbs'})
    issueDict = stratOrgTitleAbstr.get(calYear).get('issueToC').get(issue)
    issueDict = dict(enumerate([{'title' : article.find('span', {'class' : 'hlFld-Title'})\
                                                  .text\
                                                  .lower(),
                                  'published' : article.find('span', {'class' : 'tocEPubDate'})\
                                                       .text\
                                                       .replace('First Published ', ''),
                                  'abstrURL' : article.find('div', {'class' : 'abstract-section'})\
                                                      .find('a')\
                                                      .attrs
                                                      .get('href')}
                                    for article in issueSoup.find_all('td', {'valign' : 'top'})
                                    if article.find('span', {'class' : 'hlFld-Title'}) is not None]))
    for article in issueDict.keys():
        articleDict = issueDict.get(article)
        try: 
            artAbstrHtml = sleep_get(stratOrgUrl + issueDict.get(article).get('abstrURL'),
                                     headers = headers)
            artAbstrSoup = BeautifulSoup(artAbstrHtml.content,
                                         'lxml').find('div', {'class' : 'abstractSection abstractInFull'})
            issueDict.get(article).update({'abstract' : artAbstrSoup.text\
                                                                    .lower()})
        except:
            issueDict.get(article).update({'abstract' : str()})
    return issueDict


In [10]:
calYear = np.random.choice(a = list(stratOrgTitleAbstr.keys()),
                           size = 1).item(0)
issue = np.random.choice(a = list(stratOrgTitleAbstr.get(calYear).get('issueToC').keys()),
                         size = 1).item(0)
print('Calendar Year {}  issue {}\nURL {} '\
      .format(calYear, issue, stratOrgTitleAbstr.get(calYear).get('issueToC').get(issue).get('href')))

Calendar Year 2014  issue Issue 2,
URL https://journals.sagepub.com/toc/soqa/12/2 


In [21]:
for calYear in stratOrgTitleAbstr.keys():
    issuesDone = list()
    issuesMissed = list()
    for issue in stratOrgTitleAbstr.get(calYear).get('issueToC').keys():
        try:
            stratOrgTitleAbstr.get(calYear)\
                             .get('issueToC')\
                             .get(issue)\
                             .update({'titleAbstr' : getIssueArticleTitleAbstr(calYear, issue),
                                      'collectTime' : datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')})
            issuesDone.append(', '.join([calYear, issue]))
            print('{}, {} success at {}'.format(calYear,
                                                    issue,
                                                    datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))
        except:
            stratOrgTitleAbstr.get(calYear)\
                             .get('issueToC')\
                             .get(issue)\
                             .update({'titleAbstr' : dict()})
            issuesMissed.append(', '.join([calYear, issue]))
            print('{}, {} failure at {}'.format(calYear,
                                                    issue,
                                                    datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))
        with io.open('./data/stratOrgTitleAbstract.json', 'w', encoding = 'utf-8') as f:
            json.dump(stratOrgTitleAbstr, 
                      f, 
                      ensure_ascii = False, 
                      indent = 4)
        stratOrgTitleAbstr.get(calYear).update({'issuesDone' : issuesDone,
                                                'issuesMissed' : issuesMissed})
            

2014, Issue 4, success at 20-02-29, 113601Z
2014, Issue 3, success at 20-02-29, 113621Z
2014, Issue 2, success at 20-02-29, 113639Z
2014, Issue 1, success at 20-02-29, 113700Z


In [19]:
stratOrgTitleAbstr.get(calYear)\
                 .get('issueToC')

{'Issue 4,': {'href': 'https://journals.sagepub.com/toc/soqa/12/4',
  'titleAbstr': {0: {'title': 'the influences of capital market munificence on new-venture alliance formation in emerging industries',
    'published': 'July 27, 2014',
    'abstrURL': '/doi/abs/10.1177/1476127014543629',
    'abstract': 'how do public capital markets influence alliance formation by new ventures? resource dependency logic suggests that new ventures form fewer alliances when public capital markets are munificent because public investors demand less management control than do alliance partners. in contrast, according to the strategic behavior view, public capital market munificence can also motivate alliance formation because such munificence can enhance the returns that new-venture managers expect from using alliances. we seek to reconcile these two conflicting views of reinforcing versus attenuating effects of public capital market munificence on alliance formation. in this study of new internet ventur