## Purpose.

Here we collect abstracts and titles from [*International Journal of Business Strategy*](https://ijbs-journal.org/IJBS-JOURNAL/).  This information is to be used as part of an attempt to apply text classification to charting the progression of business strategy.

## Approach.

Titles and abstracts are available without paywall login. But we have to do this in ***three stages***.  

⓵ **Get the URL for the each volume**. We start with the journal's hope page.  Our essential information is embedded in a frame depicted below on the right-hand side of the page. 

⓶ **Get a list of issue URLs**.  Each volume page contains thumbnail images of individual issues. These include URLs to the individual isses. 

⓷ **Collect lists of titles**.  Follow each issue's URL to its issue table of contents. The tables of contents contain titles, as well as URLs to pages for individual articles.

⓸ **Collect abstracts**.  Abstracts are accessible from individual articles. We have to get the abstracts from these individual-article pages.






In [1]:
# Import libaries
import requests as req
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import time
import itertools as it
import json
import io
from copy import deepcopy
import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
# driver = webdriver.Chrome()
# driver.get("https://www.nytimes.com")

headers = {
    'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
    }



In [2]:
# We use lots of list-comprehension, which drives requests.get operations.
# We need to "throttle" these, so as to avoid the appearance of a DDoS
# attack.  We accomplish this by a get_sleep function.  This function 
# executes a req.get operation, returning the result, with a one-second
# delay.
def sleep_get(url, headers):
    time.sleep(np.random.uniform(low = 2,
                                 high = 4))
    return(req.get(url,
                   headers = headers))
#
# Partition a list into a specified number of bins.  Our inputs
# are:
# ⧐ parted_list is the list to be partitioned;
# ⧐ partition_counts specifies the number of bins into which
#   parted_list is divided.
# We produce an enumerated dictionary of the list partitions.
def partition_list(parted_list, partition_counts):
    parted_list = np.sort(np.array(parted_list))
    partition_len = int(np.ceil(len(parted_list)/partition_counts))
    partitions = [np.array(object = range(partition_len)) + part * partition_len
                     for part in range(partition_counts)]
    partitions[-1] = np.arange(start = partitions[-1][0],
                               stop = parted_list.shape[0])
    return dict(enumerate([list(parted_list[part])
                             for part in partitions]))





In [3]:
# We use a Selenium-driver session to get at the contents of the directory
# page.  We find heterogeneity in the structure of the html.  It turns out that
# content prior to about 2012 is in pdf format, threfore inaccessible using this
# approach.
# https://ijbs-journal.org/IJBS-JOURNAL/
ijbsUrl = 'https://ijbs-journal.org/'
ijbsSession = webdriver.Chrome()



In [4]:
# Our issue Tables of Contents are accessed through javascript click-through links.
# We want to get all of these.  We use `BeautifulSoup` to do this.  These are demarcated
# by html tags `<input type="submit", name="«name»", id="«id»">`.  There are many of these
# for various purposes.  We identify those that interest us in that they have values 
# with the volume, issue, publication date.
issueVolSoup = {inputTag.attrs.get('value') : 
                         {'name' : inputTag.attrs.get('name'),
                          'id' : inputTag.attrs.get('id')}
                                    for inputTag in BeautifulSoup(sleep_get(ijbsUrl,
                                                                                headers = headers).content,
                                                                      'lxml').find_all('input', {'type' : 'submit'})
                                    if 'Volume' in inputTag.attrs.get('value')}



In [5]:
issue = np.random.choice(a = list(issueVolSoup.keys()),
                         size = 1).item(0)
print(f'Issue : {issue}')
issueTagName = issueVolSoup.get(issue).get('name')



# We need to click through to each issue Table of Contents.  This works only if we 
# first return to the volume/issue directory.  This directory is accessed via a menu item
# at the top of the home page demarcated by html tag
#            `<button type="button" id="changeTabTodiv_TAB_ArchiveFromTopMenu">`.  
# Once we've navigated back to this location we can click forward to an issue
# table-of-contents page.
def navigateToIssueToc(issueTagName):
    # ⓵ Return to the journal's homepage.
    ijbsSession.get(ijbsUrl)
    time.sleep(np.random.uniform(low = 2,
                                 high = 4))
    #
    # ⓶ "Click" the tab pointing to the archive section.
    ijbsSession.find_element_by_id('changeTabTodiv_TAB_ArchiveFromTopMenu').click()
    #
    # ⓷ Navigate to the issue Table-of-contents page.
    time.sleep(np.random.uniform(low = 2,
                                 high = 4))
    ijbsSession.find_element_by_name(issueVolSoup.get(issue).get('name')).click()
    pass
#
# The article information is available from a clean html-format
# website.  We get all of our information from this site.
def getArticleTitleAbstract(articleUrl):
    articleSoup = BeautifulSoup(sleep_get(articleUrl,
                                          headers = headers).content,
                                'lxml')
    issueToc = {'title' : articleSoup.find('span', {'id' : 'lbl_Title'})\
                                      .text\
                                      .lower(),
                 'abstract' : articleSoup.find('span', {'id' : 'lbl_ABSTRACT'})\
                                         .text\
                                         .lower(),
                  'pubDate' : articleSoup.find('span', {'id' : 'lbl_IssuePageDate'})\
                                         .text.split(', ')\
                                         [-1],
                  'collected' : datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')}
    try:
        issueToc.update({'keywords' : [keyword.lower()
                            for keyword in articleSoup.find('span', {'id' : 'lbl_Keywords'}).text\
                                                                              .split('; ')]})
    except:
        issueToc.update({'keywords' : list()})
    return issueToc



Issue : Volume 18, Issue 1, Published in June 2018


This particular website is kind of kludgey, to state things in diplomatic terms.  We need `selenium` to parse the javascript. `BeautifulSoup` helps us with the html, once we've stripped it out.  

In this case, our issue information is tabulated in tables on issue table-of-contents pages. We want to use `selenium` to get table itself.  This excises the table out of lots of clutter.  We then use `BeautifulSoup` for the table itself, because it gives better transparency.

Here is the documentation for the [Selenium Client Driver](https://www.selenium.dev/selenium/docs/api/py/index.html).  This [`selenium.webdriver.remote.webelement`](https://www.selenium.dev/selenium/docs/api/py/webdriver_remote/selenium.webdriver.remote.webelement.html) gives us the web-element attributes.  A somewhat-old StackOverflow posting [Get HTML Source of WebElement in Selenium WebDriver using Python](https://stackoverflow.com/questions/7263824/get-html-source-of-webelement-in-selenium-webdriver-using-python) points us to the `webelement.get_attribute('innerHTML)` method, which gives us html content parsable using `BeautifulSoup`.

So, the inner-most object encapsulating the the table of contents is an html object tagged `<div id="GridViewArchiveIssueArticles">`.  We use `selenium` to get `webelement`.  The `.get_attribute('innerHTML')` method give a `BeautifulSoup`-parsable html block.

Moving outsid-in, we get a `<tbody>`-tagged object and then use the `BeautifulSoup().find_all('tr')` to get the individual article 
entries. The unambiguous html elements demarcating individual articles in the table of contents are `<td valign="top">`.  We use the `BeautifulSoup().find_all()` method to get these items.



In [6]:
# We now cycle through the issues and add the table-of-contents information, including
# titles and abstracts. For each issue we:
issue = np.random.choice(a = list(issueVolSoup.keys()),
                         size = 1).item(0)
print(f'Issue : {issue}')
issueArticleTaken = list()
issueArticleMissed = list()
for issue in issueVolSoup.keys():
    # ⓐ Nagigate to its ToC page.
    navigateToIssueToc(issueVolSoup.get(issue).get('name'))
    #
    # ⓑ Get the html of which the tabulate issue table of contents is comprised.
    #    We use `selenium` `webelement.get_attribute('innerHTML')` method to get
    #    the html, which we then parse using BeautifulSoup.  We construct an
    #    enumerated dictionary `issueToc` containing what we need.  All we get from this
    #    💩💩💩 page is the URL for the articles themselves.
    issueTocHtml = BeautifulSoup(ijbsSession.find_element_by_id('GridViewArchiveIssueArticles')\
                                            .get_attribute('innerHTML'),
                                 'lxml').find('tbody')\
                                        .find_all('td', {'valign' : 'top'})
    issueToc = dict(enumerate([{'href' : article.find('a')\
                                                .attrs\
                                                .get('href')}
                                    for article in issueTocHtml ]))
    #
    # ⓒ We finally cycle through  the articles in issueToC.  We emply the previously-defined
    #   `getArticleTitleAbstract` to fill out the remaining information in the enumerated
    #   dictionary `issueToc`. 
    for article in issueToc.keys():
        issueToc.get(article).update(getArticleTitleAbstract(issueToc.get(article).get('href')))
    #
    # ⓓ Finally, we update our original `issueVolumeSoup` dictionary with the
    #    issueToc dictionary.
    issueVolSoup.get(issue).update({'issueToc' : issueToc})
    print('Issue {}, {} articles, success at {}'\
          .format(issue, 
                  len(issueToc), 
                  datetime.datetime.utcnow().strftime('%y-%m-%d, %H%M%SZ')) )
    with io.open('../data/intJrnlBizStrat.json', 'w', encoding = 'utf-8') as f:
        json.dump(issueVolSoup, 
                  f, 
                  ensure_ascii = False, 
                  indent = 4)



Issue : Volume 17, Issue 2, Published in October 2017
Issue Volume 19, Issue 1, Published in March 2019, 10 articles, success at 20-03-05, 012507Z
Issue Volume 18, Issue 2, Published in October 2018, 11 articles, success at 20-03-05, 012602Z
Issue Volume 18, Issue 1, Published in June 2018, 7 articles, success at 20-03-05, 012638Z
Issue Volume 18, Issue 1, Published in March 2018, 7 articles, success at 20-03-05, 012723Z
Issue Volume 17, Issue 3, Published in December 2017, 7 articles, success at 20-03-05, 012803Z
Issue Volume 17, Issue 2, Published in October 2017, 10 articles, success at 20-03-05, 012853Z
Issue Volume 17, Issue 1, Published in March 2017, 7 articles, success at 20-03-05, 012946Z
Issue Volume 16, Issue 2, Published in October 2016, 9 articles, success at 20-03-05, 013044Z
Issue Volume 16, Issue 1, Published in March 2016, 6 articles, success at 20-03-05, 013129Z
Issue Volume 15, Issue 3, Published in October 2015, 9 articles, success at 20-03-05, 013229Z
Issue Volume 

🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑🛑