## Purpose.

Here we collect abstracts and titles from [*Journal of Business Strategies*](https://www.questia.com/library/p578/journal-of-business-strategies).  This information is to be used as part of an attempt to apply text classification to charting the progression of business strategy.

## Approach.

Titles and abstracts are available without paywall login. But we have to do this in ***three stages***.  

⓵ **Get the URL for the each volume**. We start with the journal's hope page.  Our essential information is embedded in a frame depicted below on the right-hand side of the page. 

⓶ **Get a list of issue URLs**.  Each volume page contains thumbnail images of individual issues. These include URLs to the individual isses. 

⓷ **Collect lists of titles**.  Follow each issue's URL to its issue table of contents. The tables of contents contain titles, as well as URLs to pages for individual articles.

⓸ **Collect abstracts**.  Abstracts are accessible from individual articles. We have to get the abstracts from these individual-article pages.






In [1]:
# Import libaries
import requests as req
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import time
import itertools as it
import json
import io
from copy import deepcopy
import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By
# driver = webdriver.Chrome()
# driver.get("https://www.nytimes.com")

In [2]:
# We use lots of list-comprehension, which drives requests.get operations.
# We need to "throttle" these, so as to avoid the appearance of a DDoS
# attack.  We accomplish this by a get_sleep function.  This function 
# executes a req.get operation, returning the result, with a one-second
# delay.
def sleep_get(url, headers):
    time.sleep(np.random.uniform(low = 1,
                                 high = 2.5))
    return(req.get(url,
                   headers = headers))
#
# Partition a list into a specified number of bins.  Our inputs
# are:
# ⧐ parted_list is the list to be partitioned;
# ⧐ partition_counts specifies the number of bins into which
#   parted_list is divided.
# We produce an enumerated dictionary of the list partitions.
def partition_list(parted_list, partition_counts):
    parted_list = np.sort(np.array(parted_list))
    partition_len = int(np.ceil(len(parted_list)/partition_counts))
    partitions = [np.array(object = range(partition_len)) + part * partition_len
                     for part in range(partition_counts)]
    partitions[-1] = np.arange(start = partitions[-1][0],
                               stop = parted_list.shape[0])
    return dict(enumerate([list(parted_list[part])
                             for part in partitions]))





In [None]:
https://www.questia.com/library/p578/journal-of-business-strategies

In [3]:
jbs_url = 'https://www.questia.com/'
driver = webdriver.Chrome()
driver.get(jbs_url + '/library/p578/journal-of-business-strategies')
volIssues = {issue.text : {'href' : issue.get_attribute('href')}
            for issue in driver.find_elements_by_tag_name('a')
            if issue.text[:3] == 'Vol'}

In [4]:
issue = np.random.choice(a = list(volIssues.keys()),
                         size = 1).item(0)


In [5]:
for issue in volIssues.keys():
    print(f'Issue : {issue}')
    time.sleep(np.random.uniform(low = 1,
                                 high = 3))
    driver.get(volIssues.get(issue).get('href'))
    issueToc = dict(enumerate([{'href' : article.get_attribute('href'),
                                'title' : article.text}
                                 for article in driver.find_elements_by_class_name('li-title')]))
    volIssues.get(issue).update({'issueToc' : issueToc})
    print('{}, {} articles, success at {}'.format(issue, 
                                              str(len(issueToc)),
                                              datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))

Issue : Vol. 36, No. 2, Fall
Vol. 36, No. 2, Fall, 4 articles, success at 20-03-03, 233150Z
Issue : Vol. 36, No. 1, Spring
Vol. 36, No. 1, Spring, 4 articles, success at 20-03-03, 233153Z
Issue : Vol. 35, No. 2, Fall
Vol. 35, No. 2, Fall, 4 articles, success at 20-03-03, 233156Z
Issue : Vol. 35, No. 1, Spring
Vol. 35, No. 1, Spring, 4 articles, success at 20-03-03, 233200Z
Issue : Vol. 34, No. 2, Fall
Vol. 34, No. 2, Fall, 4 articles, success at 20-03-03, 233203Z
Issue : Vol. 34, No. 1, Spring
Vol. 34, No. 1, Spring, 3 articles, success at 20-03-03, 233206Z
Issue : Vol. 33, No. 2, Fall
Vol. 33, No. 2, Fall, 4 articles, success at 20-03-03, 233209Z
Issue : Vol. 33, No. 1, Spring
Vol. 33, No. 1, Spring, 4 articles, success at 20-03-03, 233212Z
Issue : Vol. 32, No. 2, Fall
Vol. 32, No. 2, Fall, 4 articles, success at 20-03-03, 233216Z
Issue : Vol. 32, No. 1, Spring
Vol. 32, No. 1, Spring, 4 articles, success at 20-03-03, 233220Z
Issue : Vol. 31, No. 2, Fall
Vol. 31, No. 2, Fall, 4 article

In [10]:
volIssues.get(issue).get('issueToc')

{0: {'href': 'https://www.questia.com/library/journal/1G1-613341333/a-model-relating-ceo-social-intelligence-and-strategic',
  'title': 'A Model Relating CEO Social Intelligence and Strategic Engagement: A Study in Small Business Organizations'},
 1: {'href': 'https://www.questia.com/library/journal/1G1-613341332/board-gender-diversity-social-performance-and-ceo',
  'title': 'Board Gender Diversity, Social Performance, and CEO Compensation'},
 2: {'href': 'https://www.questia.com/library/journal/1G1-613341334/musky-tweets-information-transfer-in-the-age-of-unbridled',
  'title': 'Musky Tweets: Information Transfer in the Age of Unbridled Social Media'},
 3: {'href': 'https://www.questia.com/library/journal/1G1-613341335/reconfiguring-dependencies-through-acquisitions-a',
  'title': 'Reconfiguring Dependencies through Acquisitions: A Resource Dependency Perspective'}}

In [40]:
driver = webdriver.Chrome()

for issue in volIssues.keys():
    issueToc = volIssues.get(issue).get('issueToc')
    for article in issueToc.keys():
        time.sleep(np.random.uniform(low = 1,
                                     high = 2.5))
        driver.get(volIssues.get(issue).get('issueToc').get(article).get('href'))
        try: 
            issueToc.get(article).update({'abstract' : driver.find_element_by_id('teaserShowMore')\
                                                             .text\
                                                             .replace('\n', ' ')\
                                                              .replace('*', '')\
                                                             .lower(),
                                          'collected' : datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')})
        except:
            issueToc.get(article).update({'abstract' : str()})
        try:
            issueToc.get(article).update({'keyWords' : [detailItem.text\
                                                              .lower()\
                                                              .split('\n')[1:]
                                                      for detailItem in driver.find_elements_by_class_name('detail-item')\
                                                      if 'SUBJECTS' in detailItem.text][0]})
        except:
            issueToc.get(article).update({'keyWords': list()})
    print('{}, {} articles, success at {}'.format(issue, 
                                                  str(len(issueToc)),
                                                  datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))

    with io.open('../data/jrnlBizStrategies.json', 'w', encoding = 'utf-8') as f:
        json.dump(volIssues, 
                  f, 
                  ensure_ascii = False, 
                  indent = 4)



Vol. 36, No. 2, Fall, 4 articles, success at 20-03-04, 003645Z
Vol. 36, No. 1, Spring, 4 articles, success at 20-03-04, 003700Z
Vol. 35, No. 2, Fall, 4 articles, success at 20-03-04, 003712Z
Vol. 35, No. 1, Spring, 4 articles, success at 20-03-04, 003724Z
Vol. 34, No. 2, Fall, 4 articles, success at 20-03-04, 003738Z
Vol. 34, No. 1, Spring, 3 articles, success at 20-03-04, 003750Z
Vol. 33, No. 2, Fall, 4 articles, success at 20-03-04, 003803Z
Vol. 33, No. 1, Spring, 4 articles, success at 20-03-04, 003819Z
Vol. 32, No. 2, Fall, 4 articles, success at 20-03-04, 003832Z
Vol. 32, No. 1, Spring, 4 articles, success at 20-03-04, 003845Z
Vol. 31, No. 2, Fall, 4 articles, success at 20-03-04, 003900Z
Vol. 31, No. 1, Spring, 4 articles, success at 20-03-04, 003913Z
Vol. 30, No. 2, Fall, 4 articles, success at 20-03-04, 003924Z
Vol. 30, No. 1, Spring, 4 articles, success at 20-03-04, 003936Z
Vol. 29, No. 2, Fall, 4 articles, success at 20-03-04, 003951Z
Vol. 29, No. 1, Spring, 0 articles, succe

In [14]:
article

0