## Purpose.

Here we collect abstracts and titles from [*Business Management and Strategy*](http://www.macrothink.org/journal/index.php/bms) magazine, formerly published by the consultancy Booze-Allen-Hamilton.  This information is to be used as part of an attempt to apply text classification to charting the progression of business strategy.

## Approach.

Titles and abstracts are available without paywall login. But we have to do this in ***three stages***.  

⓵ **Get the URL for the each volume**. We start with the journal's hope page.  Our essential information is embedded in a frame depicted below on the right-hand side of the page. 

⓶ **Get a list of issue URLs**.  Each volume page contains thumbnail images of individual issues. These include URLs to the individual isses. 

⓷ **Collect lists of titles**.  Follow each issue's URL to its issue table of contents. The tables of contents contain titles, as well as URLs to pages for individual articles.

⓸ **Collect abstracts**.  Abstracts are accessible from individual articles. We have to get the abstracts from these individual-article pages.


### Running on Azure.

If we need to run in Azure, we will need to install ChromeDriver on the Ubuntu host.  This https://www.srcmake.com/home/selenium-python-chromedriver-ubuntu describes the procedure.




In [1]:
# Import libaries
import requests as req
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import time
import itertools as it
import json
import io
from copy import deepcopy
import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By


In [2]:
# We use lots of list-comprehension, which drives requests.get operations.
# We need to "throttle" these, so as to avoid the appearance of a DDoS
# attack.  We accomplish this by a get_sleep function.  This function 
# executes a req.get operation, returning the result, with a one-second
# delay.
def sleep_get(url, headers):
    time.sleep(np.random.uniform(low = 1.5,
                                 high = 2.75))
    return(req.get(url,
                   headers = headers))
#
# Partition a list into a specified number of bins.  Our inputs
# are:
# ⧐ parted_list is the list to be partitioned;
# ⧐ partition_counts specifies the number of bins into which
#   parted_list is divided.
# We produce an enumerated dictionary of the list partitions.
def partition_list(parted_list, partition_counts):
    parted_list = np.sort(np.array(parted_list))
    partition_len = int(np.ceil(len(parted_list)/partition_counts))
    partitions = [np.array(object = range(partition_len)) + part * partition_len
                     for part in range(partition_counts)]
    partitions[-1] = np.arange(start = partitions[-1][0],
                               stop = parted_list.shape[0])
    return dict(enumerate([list(parted_list[part])
                             for part in partitions]))


In [3]:
bizMgtStrat = 'http://www.macrothink.org/journal/index.php/bms'
headers = {
    'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
# /loi/1932443x Stragic Entrepreneurship Journal
# /loi/10970266 Strategic Management Journal
bizMgtStratHtml = sleep_get(bizMgtStrat + '/issue/archive',
                            headers = headers)
bizMgtStratSoup = BeautifulSoup(bizMgtStratHtml.content, 
                               'lxml').find_all('h4')

In [4]:
# All published issues are indexed on a common page.  We get the volume and issue information, here.
# We look for publication date on the issue pages.
volIssue = {issue.text.split(' (')[0] : {'href' : issue.find('a').attrs.get('href')}
                    for issue in bizMgtStratSoup }

In [5]:


# Issue tables of contents are contained in a block demarcated by labels
# `<div id="content">`.  The simple layout allows us to get all of the individual-article
# information (except for abstracts) via dictionary comprehension.  Each article is
# demarcated by an html label `<div class="tocTitle">`.  The publication date appears in 
# a header block with a comma-separated field.
def getIssueToc(issueUrl):
    issueSoup = BeautifulSoup(sleep_get(volIssue.get(issue).get('href'),
                                        headers = headers).content,
                              'lxml').find('div', {'id' : 'content'})
    issueToc = dict(enumerate([{'title' : article.find('a')
                                                 .text
                                                 .lower(),
                      'href' : article.find('a')
                                      .attrs
                                      .get('href'),
                      'pubDate' : issueSoup.find('div', {'id' : 'issueDescription'})
                                           .text
                                           .split(', ')[-1]
                                           .split('\\')[0]}
                        for article in issueSoup.find_all('div', {'class' : 'tocTitle'})]))
    #
    # The abstract appears on each article's individual page.
    for article in issueToc.keys():
#         try:
        articleSoup = BeautifulSoup(sleep_get(issueToc.get(article).get('href'),
                                              headers = headers).content,
                                    'lxml').find('div', {'id' : 'articleAbstract'})
        issueToc.get(article).update({'abstract' : articleSoup.find('div')
                                                              .text
                                                              .lower(),
                                      'collected' : datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')})
#         except:
#             issueToc.get(article).update({'abstract' : str()})
    #
    # Return the issue table of contents.
    return issueToc



In [6]:
for issue in volIssue.keys():
    issueUrl = volIssue.get(issue).get('href')
    issueToc = getIssueToc(issueUrl)
    volIssue.get(issue).update({'issueToC' : issueToc})
    print('Issue {} success, {} articles at {}'\
          .format(issue,
                  len(issueToc),
                  datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))
    with io.open('../data/bizMgtStrat.json', 'w', encoding = 'utf-8') as f:
        json.dump(volIssue, 
                  f, 
                  ensure_ascii = False, 
                  indent = 4)




Issue Vol 11, No 1 success, 5 articles at 20-03-07, 083350Z


NameError: name 'smj_volumes' is not defined

In [None]:
# issue = np.random.choice(a = list(volIssue.keys()),
#                          size = 1).item(0)
issueUrl = volIssue.get(issue).get('href')
print(f'Issue : {issue}\nURL : {issueUrl}')
issueSoup = BeautifulSoup(sleep_get(volIssue.get(issue).get('href'),
                                    headers = headers).content,
                          'lxml').find('div', {'id' : 'content'})
issueToc = dict(enumerate([{'title' : article.find('a')
                                             .text
                                             .lower(),
                  'href' : article.find('a')
                                  .attrs
                                  .get('href'),
                  'pubDate' : issueSoup.find('div', {'id' : 'issueDescription'})
                                       .text
                                       .split(', ')[-1]
                                       .split('\\')[0]}
                    for article in issueSoup.find_all('div', {'class' : 'tocTitle'})]))
# article = np.random.choice(a = list(issueToc.keys()),
#                            size = 1).item(0)
articleUrl = issueToc.get(article).get('href')
print('Article : {}\nURL : {}'\
      .format(issueToc.get(article).get('title'),
              issueToc.get(article).get('href')))
articleSoup = BeautifulSoup(sleep_get(articleUrl,
                                      headers = headers).content,
                            'lxml').find('div', {'id' : 'articleAbstract'})
issueToc.get(article).update({'abstract' :articleSoup.find('p').get_text(strip = True)\
                                                               .lower()})

In [None]:
list(articleSoup.children)

In [None]:
volIssue

In [None]:
# ⓵ Obtain the URLs for each volume. These located on the journal's homepage.
#    collect this in a dictionary labeled `smj_volumes`.  This is a simple 
#    nested dictionary of {'volume label ' : {'href' : vol_url}}.
#    The vol_url attribute is appended to our smj url to get the
#    URL for the issue table of contents.
smj_volumes = {issues.find('a').attrs.get('title') : {'href' : issues.find('a').attrs.get('href')}
                                                     for issues in smj_soup.find('div', {'class' : "loi--aside__left"})\
                                                                           .find_all('li')
                                                     if issues.find('a').attrs.get('href') != '#' }

def getVolumeIssues(volUrl):
    volSoup = BeautifulSoup(sleep_get(volUrl,
                                      headers = headers).content,
                            'lxml').find('ul', {'class' : 'rlist loi__issues'})\
                                   .find_all('li', {'class' : 'card clearfix'})
# issue = volSoup[np.random.choice(a = range(len(volSoup)),
#                                  size = 1).item(0)]
    return {issue.find('a', {'class' : 'visitable'})
                 .text : {'href' : issue.find('a', {'class' : 'visitable'})\
                                       .attrs\
                                       .get('href'),
                         'coverDate' : issue.find('span', {'class' : 'cover-date-value'}).text }
             for issue in volSoup}
# ⓶ Get the issues associated with each volume.
for volume in smj_volumes.keys():
    smj_volumes.get(volume).update({'issues' : getVolumeIssues(smj_url + smj_volumes.get(volume).get('href'))})
    print('{} issues collected, success at {}'.format(volume, 
                                                      datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))




In [None]:
# volume = np.random.choice(a = list(smj_volumes.keys()),
#                           size = 1).item(0)
# issue = np.random.choice(a = list(smj_volumes.get(volume).get('issues').keys()),
#                          size = 1).item(0)
# print(f'Volume : {volume},   Issue : {issue}')
# issue_url = smj_volumes.get(volume).get('issues').get(issue).get('href')
# print('Issue URL : {}'.format(smj_url + issue_url) )

# smj_volumes.get(volume).get('issues').get(issue).update({'issueToc' : get_issue_tocs(issue_url)})

In [None]:
# ⓷ Get the tables of contents for each issue of each volumne. We add this to
#    our smj_volumes dictionary so that our dictionary look like
#      {'volumne label' : {'href' : vol_url,
#                          'issues' : {'issue label' : 'issue ToC URL'}}}.
#    For compactness of logic, we construct a function `get_issue_tocs`
#    to get the actual tables of contents.  We then invoke this attribute 
#    via dictionary comprehension to get 
def get_issue_tocs(issue_url):
    issue_soup = BeautifulSoup(sleep_get(smj_url + issue_url,
                                          headers = headers).content,
                               'lxml').find_all('div', {'class' : 'issue-item'})
    issueToC = list()
    for issue in issue_soup:
        try:
            issueToC.append({'title' : issue.find('h2').text,
                             'pubDate' : issue.find('li', {'class' : 'ePubDate'})\
                                              .find_all('span')\
                                              [1]\
                                              .text,
                             'abstractHref' : issue.find('a', {'class' : 'issue-item__title visitable'})\
                                                   .attrs\
                                                   .get('href')})
        except:
            pass
    return dict(zip(map(str, range(len(issueToC))),
                    issueToC))

In [None]:
for volume in smj_volumes.keys():
    for issue in smj_volumes.get(volume).get('issues').keys():
        issueDict = smj_volumes.get(volume).get('issues').get(issue)
        issue_url = issueDict.get('href')
        try:
            issueToc = get_issue_tocs(issue_url)
            issueDict.update({'issueToC' : issueToc})
            print('Volume {}, {} success, {} articles at {}'\
                  .format(volume,
                          issue,
                          len(issueToc),
                          datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))
        except:
            issueToc = get_issue_tocs(issue_url)
            issueDict.update({'issueToC' : dict()})
            print('Volume {}, {} failure at {}'\
                  .format(volume,
                          issue,
                          datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))
    with io.open('../data/stratEntrepren.json', 'w', encoding = 'utf-8') as f:
        json.dump(smj_volumes, 
                  f, 
                  ensure_ascii = False, 
                  indent = 4)
    print('Saved {} to JSON at {}'\
          .format(volume,
                  datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))

            

In [None]:
smj_volumes.get(volume).get('issues').keys()

In [None]:
volume = np.random.choice(a = list(smj_volumes.keys()),
                          size = 1).item(0)
issue = np.random.choice(a = list(smj_volumes.get(volume).get('issues').keys()),
                         size = 1).item(0)
issueTocDict = smj_volumes.get(volume).get('issues').get(issue).get('issueToC')
article = np.random.choice(a = list(issueTocDict.keys()),
                           size = 1).item(0)
print('Volume : {},   Issue : {},\nArticle  : {}\nAbstract URL : {}'\
       .format(volume, 
               issue, 
               issueTocDict.get(article).get('title'),
               smj_url + issueTocDict.get(article).get('abstractHref')) )




In [None]:
driver = webdriver.Chrome()

for volume in smj_volumes.keys():
    for issue in smj_volumes.get(volume).get('issues').keys():
        issueTocDict = smj_volumes.get(volume).get('issues').get(issue).get('issueToC')
        for article in issueTocDict.keys():
            time.sleep(np.random.uniform(low = 0.75,
                                         high = 2.75))
            driver.get(smj_url + issueTocDict.get(article).get('abstractHref'))
            try: 
                issueTocDict.get(article).update({'abstract' : 
                                                      driver.find_element_by_id('section-1-en')\
                                                             .text\
                                                             .lower()\
                                                             .replace('\n','')\
                                                             .replace('abstract', '')\
                                                             .replace('research summary', ' ')\
                                                             .replace('managerial summary', ' ')})
            except:
                issueTocDict.get(article).update({'abstract' : str()})
            try:
                issueTocDict.get(article).update({'keywords' : 
                                              [keyword.text
                                                    for keyword in driver.find_elements_by_class_name('badge-type')]})
            except:
                issueTocDict.get(article).update({'keywords' : list()})
            issueTocDict.get(article).update({'collected' : datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')})
        print('Volume {}, {} success, {} articles at {}'\
              .format(volume,
                      issue,
                      len(issueTocDict),
                      datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))

    with io.open('../data/stratEntrepren.json', 'w', encoding = 'utf-8') as f:
        json.dump(smj_volumes, 
                  f, 
                  ensure_ascii = False, 
                  indent = 4)
    print('Saved {} to JSON at {}'\
          .format(volume,
                  datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))



In [None]:
smj_volumes.get(volume).get('issues')