# NLP of SEG Geophysics journal

In this first notebook I use some basic web scraping packages to extract information from the digital library of Geophysics journal of Society of Exploration Geophysicists. 

In the following notebook such information will be used to obtain more or less useful (and interesting) statistics.

First of all let's import some useful packages

In [1]:
import os
import glob 
from datetime import datetime,date

import requests
from bs4 import BeautifulSoup
import re
import nltk
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
import pickle
import time

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from IPython.display import display

from utils import *


# Figures inline and set visualization style
%matplotlib inline
sns.set()
sns.set_style("whitegrid")

## Web scraping

Let's start choosing where data downloaded from SEG website will be saved

In [2]:
pathSEG='./data/'

Get web-links of all the Available Volumes and Issues in Geophysics journal

In [3]:
url='http://library.seg.org/loi/gpysa7' # List of volumes

# Make the request 
r = requests.get(url)

# Extract HTML from Response object and print
html = r.text
#print html

# Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, "html5lib")

First interesting fact, the next block will show the number of volumes in Geophysics journal present today

In [4]:
# Create tokenizer to find weblinks for all volumes of Geophysics
tokenizer = RegexpTokenizer('"((http)s?://library.seg.org/toc/gpysa7/[0-9].*?)"')
volumes = tokenizer.tokenize(html)

# Remove first volume as it contains articles that have just been accepted.
volumes = volumes[1:] 

print('Number of Geophysics Volumes: %d ' % len(volumes))
#print volumes

Number of Geophysics Volumes: 543 


Let's start by finding categories in a single issue

In [9]:
volume = 'https://library.seg.org/toc/gpysa7/82/2'

r    = requests.get(volume)
html = r.text
cat  = find_categories(r.text)

Editor's corner: 1
Geophysics Letters: 2
Case Histories: 3
Anisotropy: 4
Borehole Geophysics: 8
Electrical and Electromagnetic Methods: 6
Gravity Exploration Methods: 2
Magnetic Resonance Sounding: 1
Passive Seismic Methods: 1
Reservoir Geophysics: 2
Seismic Interferometry : 1
Seismic Inversion: 6
Seismic Migration: 9
Seismic Modeling and Wave Propagation: 3
Seismic Velocity/Statics: 3
Signal Processing: 5
Errata: 3


I take now one article and learn how to get useful information:

- Title
- Authors
- Keywords
- Abstract
- Publication history
- Affiliations/Countries
- Number of citations

In [8]:
r = requests.get('https://library.seg.org/doi/abs/10.1190/geo2017-0166.1')

html = r.text
soup = BeautifulSoup(html, "html5lib")
info = soup.findAll('meta')
#print info

# authors
author = filter(lambda x: 'dc.Creator' in str(x), info)
#print author
author  = map(lambda x: str(x).split('"')[1].decode('utf8'), author)
print('Authors:',author)

# keywords
keywords = filter(lambda x: 'dc.Subject' in str(x), info)
#print keywords
keywords = map(lambda x: str(x).split('"')[1].decode('utf8'), keywords)
#print keywords
keywords = map(lambda x: str(x).split(';'), keywords)[0]
print('Keywords:',keywords)


# abstract
abstract = filter(lambda x: 'dc.Description' in str(x), info)
#print abstract
abstract = map(lambda x: str(x).split('"')[1].decode('utf8'), abstract)[0][8:]
print abstract
print('Abstract:',abstract)


# publication history
info = soup.findAll(text=re.compile("Received:|Accepted:|Published:"))
print info
received, accepted, published = get_pubhistory(info)
print received, accepted, published

# countries
info = soup.findAll('span')
country = filter(lambda x: 'country' in str(x), info)
country = map(lambda x: str(x).split('>')[1].split('<')[0].decode('utf8'), country)
print country
print('Country:',country)


# affiliations
info = soup.findAll('span')
affiliation = filter(lambda x: 'class="institution"' in str(x), info)
print affiliation
affiliation = map(lambda x: str(x).split('>')[1].split('<')[0].decode('utf8'), affiliation)
print affiliation
print('Affiliation:',affiliation)


# citations
info = soup.findAll('div', { "class" : "citedByEntry" })
ncitations = len(info)
print('Ncitations:',ncitations)


('Authors:', [u'Joost van der Neut', u'Matteo Ravasi', u'Yi Liu', u'Ivan Vasconcelos'])


IndexError: list index out of range

Let's now select a volume and get info for all papers in different issues and store them in .csv tables and pickles to be used later on in our statistical analysis

In [7]:
scrapedvolumes = ['82']  # list of volumes to scrape
ndois          = -1      # number of dois to process, if -1 all dois


for scrapedvolume in scrapedvolumes:

    selvolumes = filter(lambda x: scrapedvolume in str(x), [volume[0] for volume in volumes])
    print ('Selected volumes %s' % selvolumes)

    for ivolume,volume in enumerate(selvolumes):

        print('Volume %s' % volume)

        # Create folder to save useful info
        vol, issue = volume.split('/')[-2:]

        folder='/'.join(volume.split('/')[-2:]) 
        if not os.path.exists(folder):
            os.makedirs(folder)

        # Initialize containers
        df_seg    = pd.DataFrame()
        titles    = []
        authors   = []
        countries = []
        affiliations = []
        keywords  = []
        abstracts = []

        # make request
        r = requests.get(volume)
        html = r.text
        #print html

        # find categories for each doi
        categories = find_categories(html)

        # find all dois
        dois = re.findall('"((https)s?://doi.*?)"', html)
        #print dois

        # remove first doi as it is ' This issue of Geophysics '
        #dois = dois[1:]
        dois = dois[:len(categories)]

        # loop over dois and extract info
        for idoi, doi in enumerate(dois[:ndois]):

            # sleep for some time to avoid being found web scraping ;)
            time_sleep=np.round(
                np.random.uniform(0,10))
            print('Sleep for %d' % time_sleep)
            time.sleep(time_sleep)
            
            # Make the request 
            #print('DOI %s' % doi[0])
            #r = requests.get(doi[0])
            
            # rearrange doi to work with volumes before 79
            doi = '/'.join(['http://library.seg.org/doi/abs','/'.join(doi[0].split('/')[-2:])])

            print('DOI %s' % doi)
            r = requests.get(doi)

            # Extract HTML from Response object
            html = r.text
            #print html

            # Create a BeautifulSoup object from the HTML
            soup = BeautifulSoup(html, "html5lib")


            # GET USEFUL INFO #
            info    = soup.findAll('meta')
            infopub = soup.findAll(text=re.compile("Received:|Accepted:|Published:"))
            infoaff = soup.findAll('span')


            # Get title
            title = soup.title.string.split('GEOPHYSICS')[0][18:-3]
            print('Title: %s' % title)
            titles.append(title)

            # Get category
            category = categories[idoi]
            print('Category: %s' % category)


            # Get authors
            author    = filter(lambda x: 'dc.Creator' in str(x), info)
            author_df = map(lambda x: str(x).split('"')[1], author)
            author    = map(lambda x: str(x).split('"')[1].decode('utf8'), author)

            print('Authors: %s' % author)
            authors.extend(author)


            # Get keywords
            keyword     = filter(lambda x: 'dc.Subject' in str(x), info)
            if len(keyword)>0:
                keyword_df  = map(lambda x: str(x).split('"')[1], keyword)#.decode('utf8')
                keyword     = map(lambda x: str(x).split('"')[1], keyword)
                keyword     = map(lambda x: str(x).split(';'), keyword)[0]
            else:
                keyword_df='-'
                keyword='-'
            print('Keywords: %s' % keyword)
            keywords.extend(keyword)


            # Get abstracts
            abstract = filter(lambda x: 'dc.Description' in str(x), info)
            if len(abstract)>0:
                abstract = map(lambda x: str(x).split('"')[1].decode('utf8'), abstract)[0][8:]
            else:
                abstract='-'
            #print('Abstract: %s' % abstract)
            abstracts.extend(abstract)


            # Get countries
            country    = filter(lambda x: 'country' in str(x), infoaff)
            country_df = map(lambda x: str(x).split('>')[1].split('<')[0], country)
            country    = map(lambda x: str(x).split('>')[1].split('<')[0].decode('utf8'), country)

            print('Countries: %s' % country)
            countries.extend(country)


            # Get affiliations
            affiliation    = filter(lambda x: 'institution' in str(x), infoaff)
            affiliation_df = map(lambda x: str(x).split('>')[1].split('<')[0], affiliation)
            affiliation    = map(lambda x: str(x).split('>')[1].split('<')[0].decode('utf8'), affiliation)

            print('Affiliations: %s' % affiliation)
            affiliations.extend(affiliation)


            # Get publication history
            pubhistory = get_pubhistory(infopub)
            print('Publication history: %s\n' % str(pubhistory))


            # Get number of citations
            citations = soup.findAll('div', { "class" : "citedByEntry" })
            ncitations = len(citations)
            print('Number of citations: %d\n' % ncitations)


            # check that I am not being banned by website...
            #if len(author)==0:
            #    print('Last DOI %s')
            #    raise Exception('No Authors')

            df_seg = df_seg.append(pd.DataFrame({'Title'         : title.encode('utf8'), 
                                                 'Category'      : category.encode('utf8'),
                                                 'Authors'       : ('; ').join(author_df),
                                                 'Countries'     : ('; ').join(country_df),
                                                 'Affiliations'  : ('; ').join(affiliation_df),
                                                 'Keywords'      : keyword_df[0],
                                                 'Received'      : pd.Timestamp(pubhistory[0]),
                                                 'Accepted'      : pd.Timestamp(pubhistory[1]),
                                                 'Published'     : pd.Timestamp(pubhistory[2]),
                                                 'Volume'        : vol,
                                                 'Issue'         : issue,
                                                 'Ncitations'    : ncitations}, index=[0]), ignore_index=True)


        # save dataframe
        df_seg.to_csv(pathSEG+folder+'/df_SEG.csv')

        # loop through titles and get all words
        words_title = words_from_text(titles)
        #print words_title
        #words_title = [x.encode('utf-8') for x in words_title]

        # loop through abstracts and get all words
        words_abstract = words_from_text(abstracts)
        #print words_abstract

        # Save words and authors into pickles
        with open(pathSEG+folder+'/wordstitle_SEG', 'wb') as fp:
            pickle.dump(words_title, fp)

        with open(pathSEG+folder+'/wordsabstract_SEG', 'wb') as fp:
            pickle.dump(words_abstract, fp)

        with open(pathSEG+folder+'/authors_SEG', 'wb') as fp:
            pickle.dump(authors, fp)

        with open(pathSEG+folder+'/countries_SEG', 'wb') as fp:
            pickle.dump(countries, fp)

        with open(pathSEG+folder+'/affiliations_SEG', 'wb') as fp:
            pickle.dump(affiliations, fp)
   


Selected volumes [u'https://library.seg.org/toc/gpysa7/82/6', u'https://library.seg.org/toc/gpysa7/82/5', u'https://library.seg.org/toc/gpysa7/82/4', u'https://library.seg.org/toc/gpysa7/82/3', u'https://library.seg.org/toc/gpysa7/82/2', u'https://library.seg.org/toc/gpysa7/82/1']
Volume https://library.seg.org/toc/gpysa7/82/6
Editor's corner: 2
Geophysics Letters: 1
Case Histories: 6
Anisotropy: 4
Borehole Geophysics: 4
Electrical and Electromagnetic Methods: 6
Engineering and Environmental Geophysics: 1
Gravity Exploration Methods: 2
Ground-penetrating Radar: 1
Interdisciplinary Studies: 2
Magnetic Exploration Methods: 1
Magnetic Resonance Sounding: 2
Passive Seismic Methods: 2
Reservoir Geophysics: 1
Rock Physics: 5
Seismic Amplitude Interpretation: 1
Seismic Attributes and Pattern Recognition: 2
Seismic Data Acquisition: 4
Seismic Interferometry : 2
Seismic Inversion: 2
Seismic Migration: 7
Seismic Modeling and Wave Propagation: 1
Seismic Velocity/Statics: 2
Signal Processing: 3
Re

DOI http://library.seg.org/doi/abs/10.1190/geo2017-0215.1
Title: A new parameterization for acoustic orthorhombic media
Category: Anisotropy
Authors: [u'Shibo Xu', u'Alexey Stovas']
Keywords: ['traveltime approximation', ' orthorhombic model', ' seismic modeling']
Countries: [u'Norway']
Affiliations: [u'Norwegian University of Science and Technology']
Publication history: (datetime.datetime(2017, 4, 10, 0, 0), datetime.datetime(2017, 8, 11, 0, 0), datetime.datetime(2017, 10, 5, 0, 0))

Number of citations: 0

Sleep for 5
DOI http://library.seg.org/doi/abs/10.1190/geo2016-0545.1
Title: Application of nanoindentation for uncertainty assessment of elastic properties in mudrocks from micro- to well-log scales
Category: Borehole Geophysics
Authors: [u'Clotilde Chen Valdes', u'Zoya Heidari']
Keywords: ['elastic', ' acoustic', ' log analysis']
Countries: [u'USA', u'USA']
Affiliations: [u'Texas A&amp;M University', u'The University of Texas at Austin']
Publication history: (datetime.datetime(2

DOI http://library.seg.org/doi/abs/10.1190/geo2016-0418.1
Title: Mathematical properties and physical meaning of the gravity gradient tensor eigenvalues
Category: Gravity Exploration Methods
Authors: [u'Carlos Cevallos']
Keywords: ['gravity', ' gradient', ' tensor', ' eigenvalues']
Countries: [u'Australia']
Affiliations: [u'Consultant']
Publication history: (datetime.datetime(2016, 8, 4, 0, 0), datetime.datetime(2017, 6, 29, 0, 0), datetime.datetime(2017, 9, 6, 0, 0))

Number of citations: 0

Sleep for 2
DOI http://library.seg.org/doi/abs/10.1190/geo2016-0008.1
Title: Joint acoustic full-waveform inversion of crosshole seismic and ground-penetrating radar data in the frequency domain
Category: Ground-penetrating Radar
Authors: [u'Xuan Feng', u'Qianci Ren', u'Cai Liu', u'Xuebing Zhang']
Keywords: ['joint inversion', ' cross-gradient constraint', ' full-waveform inversion', ' crosshole seismic', ' crosshole GPR', ' truncated Newton method']
Countries: [u'USA', u'China', u'China']
Affilia

KeyboardInterrupt: 