## Purpose.

Here we collect abstracts and titles from [*California Management Review*](https://cmr.berkeley.edu/browse/issues/).  This information is to be used as part of an attempt to apply text classification to charting the progression of business strategy.

## Approach.

Titles and abstracts are available without paywall login. But we have to do this in ***three stages***.  

⓵ **Get the URL for the each volume**. We start with the journal's hope page.  Our essential information is embedded in a frame depicted below on the right-hand side of the page. 

⓶ **Get a list of issue URLs**.  Each volume page contains thumbnail images of individual issues. These include URLs to the individual isses. 

⓷ **Collect lists of titles**.  Follow each issue's URL to its issue table of contents. The tables of contents contain titles, as well as URLs to pages for individual articles.

⓸ **Collect abstracts**.  Abstracts are accessible from individual articles. We have to get the abstracts from these individual-article pages.






In [1]:
# Import libaries
import requests as req
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import time
import itertools as it
import json
import io
from copy import deepcopy
import datetime

In [2]:
# We use lots of list-comprehension, which drives requests.get operations.
# We need to "throttle" these, so as to avoid the appearance of a DDoS
# attack.  We accomplish this by a get_sleep function.  This function 
# executes a req.get operation, returning the result, with a one-second
# delay.
def sleep_get(url, headers):
    time.sleep(np.random.uniform(low = 0.5,
                                 high = 2.5))
    return(req.get(url,
                   headers = headers))
#
# Partition a list into a specified number of bins.  Our inputs
# are:
# ⧐ parted_list is the list to be partitioned;
# ⧐ partition_counts specifies the number of bins into which
#   parted_list is divided.
# We produce an enumerated dictionary of the list partitions.
def partition_list(parted_list, partition_counts):
    parted_list = np.sort(np.array(parted_list))
    partition_len = int(np.ceil(len(parted_list)/partition_counts))
    partitions = [np.array(object = range(partition_len)) + part * partition_len
                     for part in range(partition_counts)]
    partitions[-1] = np.arange(start = partitions[-1][0],
                               stop = parted_list.shape[0])
    return dict(enumerate([list(parted_list[part])
                             for part in partitions]))





In [3]:
def getPubYearIssues(pubYear):
    pubYearUrl = cmrUrl + cmrTitleAbstract.get(pubYear).get('href')
    pubYearSoup = BeautifulSoup(sleep_get(pubYearUrl,
                                          headers = headers).content)
    return\
    {issue.find('a').get_text(strip = True).replace(',','') : 
                                              {'href' : issue.find('a').attrs.get('href'),
                                               'coverDate' : issue.find('span').get_text(strip = True)}
        for issue in pubYearSoup.find_all('div', {'class' : 'row js_issue'})}
def getIssueToc(pubYear, issue):
    issueURL = cmrTitleAbstract.get(pubYear).get('issues').get(issue).get('href')
    issueSoup = BeautifulSoup(sleep_get(issueURL,
                                        headers = headers).content,
                              'lxml').find('div', {'class' : 'tocContent'})\
                                     .find_all('td', {'valign' : 'top'})
    issueToc = dict(enumerate([{'title' : article.find('div', {'class' : 'art_title linkable'})
                                                    .find('h3', {'class' : 'heading-title'})
                                                    .text\
                                                    .lower(),
                                  'abstrURL' : article.find('a', {'data-item-name' : 'click-article-title'})\
                                                      .attrs\
                                                      .get('href')}
                                    for article in issueSoup
                                    if article.find('div', {'class' : 'art_title linkable'}) is not None]))
    for article in issueToc.keys():
        articleSoup = BeautifulSoup(sleep_get(cmrUrl + issueToc.get(article).get('abstrURL'),
                                              headers = headers).content,
                                    'lxml')
        try: 
            issueToc.get(article).update({'abstract' : articleSoup.find('div', {'class' : 'abstractSection abstractInFull'})\
                                                                  .text\
                                                                  .lower()})
        except:
            issueToc.get(article).update({'abstract' : str()})
        try:
            issueToc.get(article).update({'keyWords' : [keyWord.text
                                                        for keyWord in articleSoup.find('kwd-group')\
                                                                                  .find_all('a')]})
        except:
            issueToc.get(article).update({'keyWords' : list()})
    issueToc = {issue : toc
                for (issue, toc) in issueToc.items()
                if len(toc.get('abstract')) > 0}
    return issueToc

In [4]:
cmrUrl = 'https://journals.sagepub.com'
headers = {
    'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }
cmrlHtml = sleep_get(cmrUrl + '/loi/CMR?year=2010-2019',
                     headers = headers)
cmrSoup = BeautifulSoup(cmrlHtml.content, 'lxml')




In [5]:
# This joural is organized according to publication year and issues therein.  Issues
# are released quarterly.  Here, we get the publication year and the issues.
cmrTitleAbstract = {pubYear.find('a').text : {'href' : pubYear.find('a').attrs.get('href')}
                                    for pubYear in cmrSoup.find_all('h4')
                                    if (pubYear.find('a') is not None)}
#
# Publication years are only scrapable since 2012.  Filter the dictionary accordingly.
cmrTitleAbstract = {pubYear :pubs
                    for (pubYear, pubs) in cmrTitleAbstract.items()
                    if float(pubYear) >= 2012}

for pubYear in cmrTitleAbstract.keys():
    cmrTitleAbstract.get(pubYear).update({'issues' : getPubYearIssues(pubYear)})

In [6]:
cmrTitleAbstract

{'2019': {'href': '/loi/CMR?year=2019',
  'issues': {'Issue 1Current Issue': {'href': 'https://journals.sagepub.com/toc/cmra/62/1',
    'coverDate': 'Current Issue'},
   'Issue 4 Special Issue on AI': {'href': 'https://journals.sagepub.com/toc/cmra/61/4',
    'coverDate': 'August 2019'},
   'Issue 3': {'href': 'https://journals.sagepub.com/toc/cmra/61/3',
    'coverDate': 'May 2019'},
   'Issue 2': {'href': 'https://journals.sagepub.com/toc/cmra/61/2',
    'coverDate': 'February 2019'}}},
 '2018': {'href': '/loi/CMR?year=2018',
  'issues': {'Issue 1': {'href': 'https://journals.sagepub.com/toc/cmra/61/1',
    'coverDate': 'November 2018'},
   'Issue 4': {'href': 'https://journals.sagepub.com/toc/cmra/60/4',
    'coverDate': 'August 2018'},
   'Issue 3': {'href': 'https://journals.sagepub.com/toc/cmra/60/3',
    'coverDate': 'May 2018'},
   'Issue 2': {'href': 'https://journals.sagepub.com/toc/cmra/60/2',
    'coverDate': 'February 2018'}}},
 '2017': {'href': '/loi/CMR?year=2017',
  'is

In [7]:
pubYear = np.random.choice(a = list(cmrTitleAbstract.keys()),
                           size = 1).item(0)
issue = np.random.choice(a = list(cmrTitleAbstract.get(pubYear).get("issues").keys()),
                         size = 1).item(0)
print(f'Publication Year {pubYear}, Issue {issue}')


issueDone = list()
issueMissed = list()
toc = cmrTitleAbstract.get(pubYear).get('issues').get(issue)
toc.update({'issueToc' : getIssueToc(pubYear, issue),
            'collected' : datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')})
cmrTitleAbstract.get(pubYear).get('issues')
issueDone.append(issue)
print('Publication year {}, {}, {} articles, success at {}'\
      .format(pubYear,
              issue, 
              str(len(toc.get('issueToc'))),
              datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))



Publication Year 2013, Issue Issue 1
Publication year 2013, Issue 1, 6 articles, success at 20-03-01, 174659Z


In [8]:
pubYear = np.random.choice(a = list(cmrTitleAbstract.keys()),
                           size = 2).item(0)
print('{} Publication Years started at {}'.format(str(len(cmrTitleAbstract)), 
                                        datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))
# for pubYear in np.random.choice(a = list(cmrTitleAbstract.keys()),
#                                 size = 2,
#                                 replace = False):
for pubYear in cmrTitleAbstract.keys():
    # Initalize lists within which to account for issues harvested, and missed.
    issueDone = list()
    issueMissed = list()
    #
    # Cycle through the issues in each publication year.
    for (issue, toc) in cmrTitleAbstract.get(pubYear).get('issues').items():
        try:
            toc.update({'issueToc' : getIssueToc(pubYear, issue),
                        'collected' : datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')})
            issueDone.append(issue)
            print('Publication year {}, {}, {} articles, success at {}'\
                  .format(pubYear,
                          issue, 
                          str(len(toc.get('issueToc'))),
                          datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))
        except:
            toc.update({'issueToc' : dict(),
                        'collected' : datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')})
            issueMissed.append(issue)
            print('Publication year {}, {},  failure at {}'\
                  .format(pubYear,
                          issue, 
                          datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')))
    cmrTitleAbstract.get(pubYear).update({'issueDone' : issueDone,
                                          'issueMissed' : issueMissed})
    with io.open('./data/cmrTitleAbstract.json', 'w', encoding = 'utf-8') as f:
        json.dump(cmrTitleAbstract, 
                  f, 
                  ensure_ascii = False, 
                  indent = 4)



            
        

    
    

8 Publication Years started at 20-03-01, 174659Z
Publication year 2019, Issue 1Current Issue, 7 articles, success at 20-03-01, 174721Z
Publication year 2019, Issue 4 Special Issue on AI, 8 articles, success at 20-03-01, 174749Z
Publication year 2019, Issue 3, 6 articles, success at 20-03-01, 174808Z
Publication year 2019, Issue 2, 7 articles, success at 20-03-01, 174831Z
Publication year 2018, Issue 1, 6 articles, success at 20-03-01, 174850Z
Publication year 2018, Issue 4, 5 articles, success at 20-03-01, 174910Z
Publication year 2018, Issue 3, 8 articles, success at 20-03-01, 174936Z
Publication year 2018, Issue 2, 8 articles, success at 20-03-01, 175005Z
Publication year 2017, Issue 1, 7 articles, success at 20-03-01, 175033Z
Publication year 2017, Issue 4, 7 articles, success at 20-03-01, 175056Z
Publication year 2017, Issue 3, 7 articles, success at 20-03-01, 175118Z
Publication year 2017, Issue 2, 7 articles, success at 20-03-01, 175147Z
Publication year 2016, Issue 1, 7 articles

In [9]:
pubYear = np.random.choice(a = list(cmrTitleAbstract.keys()),
                           size = 1).item(0)
# pubYear = '2014'
issue = np.random.choice(a = list(cmrTitleAbstract.get(pubYear).get('issues').keys()),
                         size = 1).item(0)
print(f'Publication year {pubYear}  Issue {issue}')
print(cmrTitleAbstract.get(pubYear).get('issues').get(issue).get('href'))

getIssueToc(pubYear, issue)

Publication year 2013  Issue Issue 1
https://journals.sagepub.com/toc/cmra/56/1


{2: {'title': 'the new perspective on organizational wrongdoing',
  'abstrURL': '/doi/pdf/10.1525/cmr.2013.56.1.5',
  'abstract': 'wrongdoing in and by organizations offends public sensibilities, is costly to organizations, and is injurious to the individuals who perpetrate it and are victimized by it. the dominant perspective on organizational wrongdoing considers it to be an abnormal phenomenon; behavior that is rare, clearly aberrant, perpetrated by people who are abhorrent (who possess outsized preference structures or perverse values), and produced by a narrow range of out of whack organizational arrangements (most frequently, misaligned incentives or perverse cultures). however, new emerging theory and research on organizational wrongdoing present a fundamentally different perspective. it views wrongdoing as a normal phenomenon; behavior that is prevalent, not much different than rightdoing, perpetrated by people who are for the most part upstanding (otherwise ethical, socially r

In [None]:
article