## Purpose.

Here we collect abstracts and titles from [*strategy+business*](https://www.strategy-business.com/) magazine, formerly published by the consultancy Booze-Allen-Hamilton.  This information is to be used as part of an attempt to apply text classification to charting the progression of business strategy.

## Approach.

Titles and abstracts are available without paywall login. But we have to do this in ***three stages***.  

⓵ **Get the URL for the each volume**. We start with the journal's hope page.  Our essential information is embedded in a frame depicted below on the right-hand side of the page. 

⓶ **Get a list of issue URLs**.  Each volume page contains thumbnail images of individual issues. These include URLs to the individual isses. 

⓷ **Collect lists of titles**.  Follow each issue's URL to its issue table of contents. The tables of contents contain titles, as well as URLs to pages for individual articles.

⓸ **Collect abstracts**.  Abstracts are accessible from individual articles. We have to get the abstracts from these individual-article pages.


### Running on Azure.

If we need to run in Azure, we will need to install ChromeDriver on the Ubuntu host.  This https://www.srcmake.com/home/selenium-python-chromedriver-ubuntu describes the procedure.




In [2]:
# Import libaries
import requests as req
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import time
import itertools as it
import json
import io
from copy import deepcopy
import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By


In [3]:
# We use lots of list-comprehension, which drives requests.get operations.
# We need to "throttle" these, so as to avoid the appearance of a DDoS
# attack.  We accomplish this by a get_sleep function.  This function 
# executes a req.get operation, returning the result, with a one-second
# delay.
def sleep_get(url, headers):
    time.sleep(np.random.uniform(low = 0.75,
                                 high = 2.75))
    return(req.get(url,
                   headers = headers))
#
# Partition a list into a specified number of bins.  Our inputs
# are:
# ⧐ parted_list is the list to be partitioned;
# ⧐ partition_counts specifies the number of bins into which
#   parted_list is divided.
# We produce an enumerated dictionary of the list partitions.
def partition_list(parted_list, partition_counts):
    parted_list = np.sort(np.array(parted_list))
    partition_len = int(np.ceil(len(parted_list)/partition_counts))
    partitions = [np.array(object = range(partition_len)) + part * partition_len
                     for part in range(partition_counts)]
    partitions[-1] = np.arange(start = partitions[-1][0],
                               stop = parted_list.shape[0])
    return dict(enumerate([list(parted_list[part])
                             for part in partitions]))


In [30]:
stratBiz = 'https://www.strategy-business.com/'
headers = {
    'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
# /loi/1932443x Stragic Entrepreneurship Journal
# /loi/10970266 Strategic Management Journal
stratBizHtml = sleep_get(stratBiz + 'past-issues',
                         headers = headers)
stratBizSoup = BeautifulSoup(stratBizHtml.content, 
                             'lxml').find('div', {'id' : 'Content'})\
                                    .find_all('h3', {'class' : 'title'})

In [29]:
# [issue.text\
#       .replace('\r\n', ' - ')\
#       .replace(' - ', '  - ')\
#       .split('  - ')
#     for issue in stratBizSoup]

In [33]:
{issue.text : {'href' : issue.find('a').attrs.get('href')}
    for issue in stratBizSoup}

{'Issue 98\r\nSpring 2020': {'href': 'https://www.strategy-business.com/current_issue'},
 'Issue 97\r\nWinter 2019': {'href': 'https://www.strategy-business.com/issue97-winter2019'},
 'Issue 96\r\nAutumn 2019': {'href': 'https://www.strategy-business.com/issue96-autumn2019'},
 'Issue 95\r\nSummer 2019': {'href': 'https://www.strategy-business.com/issue95-summer2019'},
 'Issue 94\r\nSpring 2019': {'href': 'https://www.strategy-business.com/issue94-spring2019'},
 'Issue 93\r\nWinter 2018': {'href': 'https://www.strategy-business.com/issue93-winter2018'},
 'Issue 92\r\nAutumn 2018': {'href': 'https://www.strategy-business.com/issue92-autumn2018'},
 'Issue 91 \r\nSummer 2018': {'href': 'https://www.strategy-business.com/issue91-summer2018'},
 'Issue 90 \r\nSpring 2018': {'href': 'https://www.strategy-business.com/issue90-spring2018'},
 'Issue 89 \r\nWinter 2017': {'href': 'https://www.strategy-business.com/issue89-winter2017'},
 'Issue 88 \r\nAutumn 2017': {'href': 'https://www.strategy-bu

In [4]:
# ⓵ Obtain the URLs for each volume. These located on the journal's homepage.
#    collect this in a dictionary labeled `smj_volumes`.  This is a simple 
#    nested dictionary of {'volume label ' : {'href' : vol_url}}.
#    The vol_url attribute is appended to our smj url to get the
#    URL for the issue table of contents.
smj_volumes = {issues.find('a').attrs.get('title') : {'href' : issues.find('a').attrs.get('href')}
                                                     for issues in smj_soup.find('div', {'class' : "loi--aside__left"})\
                                                                           .find_all('li')
                                                     if issues.find('a').attrs.get('href') != '#' }

def getVolumeIssues(volUrl):
    volSoup = BeautifulSoup(sleep_get(volUrl,
                                      headers = headers).content,
                            'lxml').find('ul', {'class' : 'rlist loi__issues'})\
                                   .find_all('li', {'class' : 'card clearfix'})
# issue = volSoup[np.random.choice(a = range(len(volSoup)),
#                                  size = 1).item(0)]
    return {issue.find('a', {'class' : 'visitable'})
                 .text : {'href' : issue.find('a', {'class' : 'visitable'})\
                                       .attrs\
                                       .get('href'),
                         'coverDate' : issue.find('span', {'class' : 'cover-date-value'}).text }
             for issue in volSoup}
# ⓶ Get the issues associated with each volume.
for volume in smj_volumes.keys():
    smj_volumes.get(volume).update({'issues' : getVolumeIssues(smj_url + smj_volumes.get(volume).get('href'))})
    print('{} issues collected, success at {}'.format(volume, 
                                                      datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))




2019 - Volume 13 issues collected, success at 20-03-04, 131905Z
2018 - Volume 12 issues collected, success at 20-03-04, 131908Z
2017 - Volume 11 issues collected, success at 20-03-04, 131911Z
2016 - Volume 10 issues collected, success at 20-03-04, 131914Z
2015 - Volume 9 issues collected, success at 20-03-04, 131916Z
2014 - Volume 8 issues collected, success at 20-03-04, 131918Z
2013 - Volume 7 issues collected, success at 20-03-04, 131922Z
2012 - Volume 6 issues collected, success at 20-03-04, 131926Z
2011 - Volume 5 issues collected, success at 20-03-04, 131928Z
2010 - Volume 4 issues collected, success at 20-03-04, 131931Z
2009 - Volume 3 issues collected, success at 20-03-04, 131933Z
2008 - Volume 2 issues collected, success at 20-03-04, 131935Z
2007 - Volume 1 issues collected, success at 20-03-04, 131937Z


In [None]:
# volume = np.random.choice(a = list(smj_volumes.keys()),
#                           size = 1).item(0)
# issue = np.random.choice(a = list(smj_volumes.get(volume).get('issues').keys()),
#                          size = 1).item(0)
# print(f'Volume : {volume},   Issue : {issue}')
# issue_url = smj_volumes.get(volume).get('issues').get(issue).get('href')
# print('Issue URL : {}'.format(smj_url + issue_url) )

# smj_volumes.get(volume).get('issues').get(issue).update({'issueToc' : get_issue_tocs(issue_url)})

In [5]:
# ⓷ Get the tables of contents for each issue of each volumne. We add this to
#    our smj_volumes dictionary so that our dictionary look like
#      {'volumne label' : {'href' : vol_url,
#                          'issues' : {'issue label' : 'issue ToC URL'}}}.
#    For compactness of logic, we construct a function `get_issue_tocs`
#    to get the actual tables of contents.  We then invoke this attribute 
#    via dictionary comprehension to get 
def get_issue_tocs(issue_url):
    issue_soup = BeautifulSoup(sleep_get(smj_url + issue_url,
                                          headers = headers).content,
                               'lxml').find_all('div', {'class' : 'issue-item'})
    issueToC = list()
    for issue in issue_soup:
        try:
            issueToC.append({'title' : issue.find('h2').text,
                             'pubDate' : issue.find('li', {'class' : 'ePubDate'})\
                                              .find_all('span')\
                                              [1]\
                                              .text,
                             'abstractHref' : issue.find('a', {'class' : 'issue-item__title visitable'})\
                                                   .attrs\
                                                   .get('href')})
        except:
            pass
    return dict(zip(map(str, range(len(issueToC))),
                    issueToC))

In [6]:
for volume in smj_volumes.keys():
    for issue in smj_volumes.get(volume).get('issues').keys():
        issueDict = smj_volumes.get(volume).get('issues').get(issue)
        issue_url = issueDict.get('href')
        try:
            issueToc = get_issue_tocs(issue_url)
            issueDict.update({'issueToC' : issueToc})
            print('Volume {}, {} success, {} articles at {}'\
                  .format(volume,
                          issue,
                          len(issueToc),
                          datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))
        except:
            issueToc = get_issue_tocs(issue_url)
            issueDict.update({'issueToC' : dict()})
            print('Volume {}, {} failure at {}'\
                  .format(volume,
                          issue,
                          datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))
    with io.open('../data/stratEntrepren.json', 'w', encoding = 'utf-8') as f:
        json.dump(smj_volumes, 
                  f, 
                  ensure_ascii = False, 
                  indent = 4)
    print('Saved {} to JSON at {}'\
          .format(volume,
                  datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))

            

Volume 2019 - Volume 13, Volume 13, Issue 4 success, 6 articles at 20-03-04, 131957Z
Volume 2019 - Volume 13, Volume 13, Issue 3 success, 9 articles at 20-03-04, 132001Z
Volume 2019 - Volume 13, Volume 13, Issue 2 success, 7 articles at 20-03-04, 132005Z
Volume 2019 - Volume 13, Volume 13, Issue 1 success, 7 articles at 20-03-04, 132007Z
Saved 2019 - Volume 13 to JSON at 20-03-04, 132007Z
Volume 2018 - Volume 12, Volume 12, Issue 4 success, 7 articles at 20-03-04, 132010Z
Volume 2018 - Volume 12, Volume 12, Issue 3 success, 8 articles at 20-03-04, 132013Z
Volume 2018 - Volume 12, Volume 12, Issue 2 success, 7 articles at 20-03-04, 132015Z
Volume 2018 - Volume 12, Volume 12, Issue 1 success, 9 articles at 20-03-04, 132017Z
Saved 2018 - Volume 12 to JSON at 20-03-04, 132017Z
Volume 2017 - Volume 11, Volume 11, Issue 4 success, 7 articles at 20-03-04, 132021Z
Volume 2017 - Volume 11, Volume 11, Issue 3 success, 13 articles at 20-03-04, 132024Z
Volume 2017 - Volume 11, Volume 11, Issue 2 s

In [7]:
smj_volumes.get(volume).get('issues').keys()

dict_keys(['Volume 1, Issue 3‐4', 'Volume 1, Issue 1‐2'])

In [8]:
volume = np.random.choice(a = list(smj_volumes.keys()),
                          size = 1).item(0)
issue = np.random.choice(a = list(smj_volumes.get(volume).get('issues').keys()),
                         size = 1).item(0)
issueTocDict = smj_volumes.get(volume).get('issues').get(issue).get('issueToC')
article = np.random.choice(a = list(issueTocDict.keys()),
                           size = 1).item(0)
print('Volume : {},   Issue : {},\nArticle  : {}\nAbstract URL : {}'\
       .format(volume, 
               issue, 
               issueTocDict.get(article).get('title'),
               smj_url + issueTocDict.get(article).get('abstractHref')) )




Volume : 2017 - Volume 11,   Issue : Volume 11, Issue 2,
Article  : Journal information ‐ JIF
Abstract URL : https://onlinelibrary.wiley.com/doi/10.1002/sej.1254


In [9]:
driver = webdriver.Chrome()

for volume in smj_volumes.keys():
    for issue in smj_volumes.get(volume).get('issues').keys():
        issueTocDict = smj_volumes.get(volume).get('issues').get(issue).get('issueToC')
        for article in issueTocDict.keys():
            time.sleep(np.random.uniform(low = 0.75,
                                         high = 2.75))
            driver.get(smj_url + issueTocDict.get(article).get('abstractHref'))
            try: 
                issueTocDict.get(article).update({'abstract' : 
                                                      driver.find_element_by_id('section-1-en')\
                                                             .text\
                                                             .lower()\
                                                             .replace('\n','')\
                                                             .replace('abstract', '')\
                                                             .replace('research summary', ' ')\
                                                             .replace('managerial summary', ' ')})
            except:
                issueTocDict.get(article).update({'abstract' : str()})
            try:
                issueTocDict.get(article).update({'keywords' : 
                                              [keyword.text
                                                    for keyword in driver.find_elements_by_class_name('badge-type')]})
            except:
                issueTocDict.get(article).update({'keywords' : list()})
            issueTocDict.get(article).update({'collected' : datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')})
        print('Volume {}, {} success, {} articles at {}'\
              .format(volume,
                      issue,
                      len(issueTocDict),
                      datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))

    with io.open('../data/stratEntrepren.json', 'w', encoding = 'utf-8') as f:
        json.dump(smj_volumes, 
                  f, 
                  ensure_ascii = False, 
                  indent = 4)
    print('Saved {} to JSON at {}'\
          .format(volume,
                  datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))



Volume 2019 - Volume 13, Volume 13, Issue 4 success, 6 articles at 20-03-04, 132338Z
Volume 2019 - Volume 13, Volume 13, Issue 3 success, 9 articles at 20-03-04, 132424Z
Volume 2019 - Volume 13, Volume 13, Issue 2 success, 7 articles at 20-03-04, 132456Z
Volume 2019 - Volume 13, Volume 13, Issue 1 success, 7 articles at 20-03-04, 132533Z
Saved 2019 - Volume 13 to JSON at 20-03-04, 132533Z
Volume 2018 - Volume 12, Volume 12, Issue 4 success, 7 articles at 20-03-04, 132606Z
Volume 2018 - Volume 12, Volume 12, Issue 3 success, 8 articles at 20-03-04, 132645Z
Volume 2018 - Volume 12, Volume 12, Issue 2 success, 7 articles at 20-03-04, 132711Z
Volume 2018 - Volume 12, Volume 12, Issue 1 success, 9 articles at 20-03-04, 132747Z
Saved 2018 - Volume 12 to JSON at 20-03-04, 132747Z
Volume 2017 - Volume 11, Volume 11, Issue 4 success, 7 articles at 20-03-04, 132815Z
Volume 2017 - Volume 11, Volume 11, Issue 3 success, 13 articles at 20-03-04, 132903Z
Volume 2017 - Volume 11, Volume 11, Issue 2 s

In [None]:
smj_volumes.get(volume).get('issues')