# Introduction

This file will perform a webscrape the mathematics, statistics and computer science articles found on [ArXiv.org](arxiv.org) and the mirror site [xxx.lanl.gov](xxx.lanl.gov).  The program will collect the title, aurthors, subjects and id number in groups of 25.  After getting the information for a group of 25 articles, the program will then read the pdf file using the id number and then store the title, aurthors, subjects, id number and paper in a Mongo database.  For each group of 25 articles, the site will switch between ArXiv and its mirror site to avoid being detected as a bot.


For Mongo run 

``` sudo service mongod start ```

on the AWS Machine and 

```ssh -i .ssh/aws_key.pem -NL 12345:localhost:27017 ubuntu@YOUR-IP-ADDRESS```

on my machine

In [1]:
import requests
from bs4 import BeautifulSoup
import bs4

import numpy as np
import pandas as pd

import re

from io import BytesIO

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter,TextConverter,XMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import pdfminer

from pymongo import MongoClient

import time

In [2]:
packages = (('Requests',requests),('Beautiful Soup',bs4), ('Numpy',np),
               ('Pandas',pd),('Regex',re),('pdfminer',pdfminer))

for package in packages:
    print('{0} version: {1}'.format(package[0],package[1].__version__))
    
!Python -V

Requests version: 2.18.4
Beautiful Soup version: 4.6.0
Numpy version: 1.14.1
Pandas version: 0.22.0
Regex version: 2.2.1
pdfminer version: 20170720
Python 3.6.4 :: Anaconda custom (64-bit)


## Mongo Set Up

In [3]:
client = MongoClient(port=12345) # this is the port set by the SSH tunnel
db = client.research_papers

In [4]:
db.collection_names()

['cs_papers', 'math_papers', 'stat_papers']

In [5]:
collections = {'stat':db.stat_papers, 'cs':db.cs_papers}
# collections = {'cs':db.cs_papers, 'stat':db.stat_papers, 'math':db.math_papers}
topics = collections.keys()

I am going to take the papers from 2017.  For the subjects mathematics, statistics and computer science there are approximatley 34k, 49k and 27k papers, respectively.  I wrote the code to be able to easily expand to more years.

As it turns out, taking the papers for all of 2017 in a short time period was a bit optimistic...I was able to collect some papers for the first few months...

In [6]:
year = np.char.array(['07','08','09','10','11','12','13','14','15','16','17'])
month = np.char.array(['01','02','03','04','05','06','07','08','09','10','11','12'])

years, months = np.meshgrid(year,month)
years = years.flatten()
months = months.flatten()

url_date = years+months
url_date = sorted(url_date)
url_date = url_date[-12:]

The website doesn't produce an error message when skipping past the total number of articles, it just repeats the final n number of articles.  To give the algorithm a stopping point, I will find the total number of articles.

In [7]:
def Get_Num_Articles(soup):
    '''Returns the total number of articles for the given topic.'''
    pages = []

    link_string = re.compile('/list/(.*)')

    for html in soup.find_all('a'):
        html_attrs = html.attrs
        try: 
            if re.match(link_string,html['href']):
                page_range = html.text
                if '-' in page_range:
                    page_range = page_range.split('-')
                    page = page_range[1]
                    page = int(page)
                    pages.append(page)
        except:
            None

    return max(pages)

Collecting all of the ids stored in the database, so the program can quickly check and ignore papers I already downloaded.  This came about because of some issues with storing the data.

In [8]:
all_ids = np.array([])

for database in [db.cs_papers, db.stat_papers, db.math_papers]:
    math_papers = db.cs_papers
    all_papers = []
    current_paper = math_papers.find()

    for pape in current_paper:    
        id_ = pape['_id']
        all_ids = np.append(all_ids, id_)
        
len(all_ids)

1959

# Article Class

(That's right, I made a class.  I'm probably the best programmer ever!)

The metadata on each page has the following information:
1. Title
2. Aurthors
3. Comments
4. Subjects
5. Paper ID
6. Journal Reference

When obtaining the data, the aurthors of a paper are split into multiple lists.  The code will extract the title, subject and comments from the list and the remaining of the list will be the author.  The code will separatly pull the pdf information, if it is availible.  Some articles do not have pdf links, so I will obtain all of the information for each article from the main page and then delete the ones from the list that do not have a pdf file.

In [9]:
class Article:
    def __init__(self, article_info, pdf_url, url_head):
        global all_ids
        
        self._url_head = url_head
        self.article_info = self.get_article_info(article_info)
        self.title = self.get_title()
        self.subject = self.get_subjects()
        self.author = self.get_authors()
        self.article_id = self.get_article_id(pdf_url)
        
        if self.article_id not in all_ids:
            self.abstract = self.get_abstract()
            self.article = self.get_article()
            all_ids = np.append(all_ids, self.article_id)
        else:
            self.abstract = None
            self.article = None
            
        self.mongo = self.mongodb_form()
    
    def get_article_info(self, article_info):
        '''
            Takes the text from the initital article information, convert
            to a list and make all lower case.
        '''
        
        article_info = article_info.text.strip()

        article_info = article_info.lower()

        enter_char = re.compile(r'\n')
        article_info = re.split(enter_char, article_info)

        while '' in article_info:
            article_info.remove('')
        
        return article_info

    def get_title(self):
        '''
            First checks to see if the title is given in the metadata page,
            if the title is not given it will return None and the unaltered
            article information list, otherwise it will remove the title from
            the article information list and return the title. Furthermore, 
            the title is given in the form 'Title: I am a title.' This function 
            will return just the title.
        ''' 
        
        no_title = True
        index = 0

        while no_title and index < len(self.article_info):
            ai = self.article_info[index]
            if 'title' in ai:
                no_title = False
            else:
                index+=1

        if no_title:
            return None

        else:
            title = self.article_info.pop(index)

            return title[7:]

    def get_subjects(self):
        '''
            First checks to see if the subjects are given in the metadata page,
            if the subjects were not given it will return None and the unaltered
            article information list, otherwise it will remove the subjects from
            the article information list and return a list of the subjects.
            Furthermore, the subjects are given in the form Subject: subject 1 (ss.S1);
            subject 2 (ss.S2).  This function will return a list containing all 
            subjects and topics that are either math, cs (computer science) or stat 
            in the form ss.SS.
        ''' 
        
        no_subjects = True
        index = 0

        while no_subjects and index < len(self.article_info):
            ai = self.article_info[index]
            if 'subject' in ai:
                no_subjects = False
            else:
                index+=1

        if no_subjects:
            return None

        else:
            subjects = self.article_info.pop(index)
            subjects = subjects[10:]

            subjects = subjects.split('; ')

            relevant_subjects = []

            for index, sub in enumerate(subjects):
                par_char = re.compile(r'\(.*\)')
                find_par = re.search(par_char,sub)
                find_par = find_par[0]
                find_par = find_par[1:-1]

                find_par_split = find_par.split('.')
                
                if len(find_par_split) == 1:
                    find_par_split = find_par.split('-')
                    topic, subject = find_par_split
                    
                elif len(find_par_split) == 2:
                    topic, subject = find_par_split
                    
                else:
                    print('-'*50 + ' '*5 + find_par + ' '*5 + '-'*50)
                    topic = 'cs'
                
                if topic in topics:
                    relevant_subjects.append(find_par)
            
            return relevant_subjects

    def remove_comments(self):
        '''
            First checks to see if there are comments given in the metadata page,
            if the comments were given it will remove the comments from the 
            article information list.
        ''' 
        
        no_comments = True
        index = 0

        while no_comments and index < len(self.article_info):
            ai = self.article_info[index]
            
            if 'comment' in ai:
                no_comments = False
                
            else:
                index+=1

        if not no_comments:
            self.article_info.pop(index)
            
    def remove_journal_ref(self):
        '''
            First checks to see if there are journal references given in the 
            metadata page, if there are journal references were given it 
            will remove the them from the article information list.
        ''' 
        
        no_journal_ref = True
        index = 0

        while no_journal_ref and index < len(self.article_info):
            ai = self.article_info[index]
            
            if 'journal-ref' in ai:
                no_journal_ref = False
                
            else:
                index+=1

        if not no_journal_ref:
            self.article_info.pop(index)
    

    def get_authors(self):
        '''
            First removes the comments from the metadata page, then
            checks to see if the authors are given in the metadata page,
            if the authors were not given it will return None otherwise it 
            will return a list of the authors. Furthermore, the authors are 
            given in a list in the form ['Author 1, ', 'Author 2, ', 'Author 3'],
            this code will remove the comma and extra space at the end.
        ''' 
        
        self.remove_comments()
        self.remove_journal_ref()
        
        no_authors = True
        index = 0

        while no_authors and index < len(self.article_info):
            ai = self.article_info[index]
            
            if 'author' in ai:
                no_authors = False
                
            else:
                index+=1

        if no_authors:
            return None

        else:
            self.article_info.pop(index)
            authors = []

            comma_space = re.compile(', ')

            for author in self.article_info:
                author = re.sub(comma_space, '', author)
                authors.append(author)

        return authors
        
    def get_article_id(self, pdf_url):
        '''Pulls the article id from the pdf information.'''
        
        pdf_link_string = re.compile('/pdf/(.*)')

        for pu in pdf_url.find_all('a'):
            try:
                if re.match(pdf_link_string,pu['href']):
                    pu = pu['href']

                    # takes the number for the pdf file
                    pu = pu[5:]
                    return pu
            except:
                None
                
        return None
    
    def convert_pdf(self,need_to_convert):
        '''Converts the content of the website from Bytes to text.'''
        
        try:
            manager = PDFResourceManager() 
            codec = 'utf-8'
            caching = True

            output = BytesIO()
            converter = TextConverter(manager, output, codec=codec, laparams=LAParams())

            interpreter = PDFPageInterpreter(manager, converter)   
            infile = BytesIO(need_to_convert)

            for page in PDFPage.get_pages(infile,caching=caching, check_extractable=True):
                interpreter.process_page(page)

            convertedPDF = output.getvalue()  

            infile.close(); converter.close(); output.close()

            return convertedPDF
        except:
            return None
    
    def initial_cleaning(self, text):
        '''
            Cleans the text by fixing the words that are separated by spaces, removing
            punctuation, removing numbers, making all lower-case, removing excess 
            spacing, and (attempting) to start after the last author.
        '''
        
        punc = r"-\n|!|\"|#|$|%|&|\'|\(|\)|\*|\+|,|-|\.|/|:|;|<|=|>|\?|@|\[|\\|\]|\^|_|\`|\{|\||\}|~|[0-9]+"
        things_to_drop = re.compile(punc)
        text = re.sub(things_to_drop, '', text)

        enters = re.compile(r'\n')
        text = re.sub(enters,' ',text)
        text = re.sub('\s+', ' ', text).strip()
        text = text.lower()

        last_author = re.sub(things_to_drop,'',self.author[-1])

        author_search = re.search(last_author, text)
        
        try:
            start = author_search.span()[1]+1
            text = text[start:]
            
        except:
            None
        
        return text
    
    def get_abstract(self):
        '''
            Uses the article id to go to the abstract page, pulls the 
            abtract then uses initial cleaning to clean it.
        '''
        abstract_url = self._url_head + '/abs/' + self.article_id
        
        abs_response = requests.get(abstract_url)
        text = abs_response.text

        soup = BeautifulSoup(text,'lxml')
        
        try:
            abstract = soup.find('blockquote').text
            abstract = abstract[11:]

            return self.initial_cleaning(abstract)
        
        except:
            None
        
        return None
    
    def get_article(self):
        '''
            Uses the article id to get the pdf document.  Once the reader
            reads the pdf document it will send it to convert_pdf which 
            will convert the PDF to text.
        '''

        pdf_url = self._url_head + '/pdf/' + self.article_id
        
        attempts = 0
        convertedPDF = None
        
        while (attempts < 10) & (not convertedPDF):
            pdf_response = requests.get(pdf_url)

            convertedPDF = self.convert_pdf(pdf_response.content)
            time.sleep(10)
            
            attempts +=1
        
        if not convertedPDF:
            return None
            print('bad: ',article.title)
        
        else:
            soup = BeautifulSoup(convertedPDF,'lxml')
            
            return self.initial_cleaning(soup.text)
        
   
    def mongodb_form(self):
        '''
            Converts the data into a dictionary to be stored in
            a Mongo database.
        '''
        mongo = {}
        mongo['title'] = self.title
        mongo['subject'] = self.subject
        mongo['author'] = self.author
        mongo['_id'] = self.article_id
        mongo['abstract'] = self.abstract
        mongo['article'] = self.article
        
        return mongo

# Data Scrape (aka The Most Painful Part)

In [10]:
for date in url_date[2:]:
    print(date)
    
    max_pages = 1500
    skip = 595
    
    while skip < max_pages:

        for topic in topics:
            
            col = collections[topic]

            # alernating between the site and the mirrored site to avoid being
            # banned.
            if skip // 2 == 0:
                url_head = 'https://arxiv.org' 
            else: 
                url_head = 'http://xxx.lanl.gov'
                
            url = url_head + '/list/' + topic + '/' + date +'?skip=' + str(skip) + '&show=25'
            
            response = requests.get(url)
            site_text = response.text

            soup = BeautifulSoup(site_text,'lxml')
            
#             if skip == 0:
#                 max_pages = Get_Num_Articles(soup)

            # All articles' information is sandwiched between dd's
            article_information = soup.find_all('dd')

            # All pdf information is sandwiched between dt's
            pdf_information = soup.find_all('dt')

            for index, pdf in enumerate(pdf_information):
                if 'pdf' in pdf.text:

                    a_info = article_information[index]
                    
                    try:
                        # for running over night
                        article = Article(a_info, pdf, url_head)

                    except:
                        None
                        
                    if article.article:
                        col.insert_one(article.mongo)
                                
            print('Topic: {0}, Skipping: {1}'.format(topic,skip))

        skip += 25     
            


1703
mongoed: a short note on almost sure convergence of bayes factors in the general  set-up
mongoed: an algorithm for removing sensitive information: application to  race-independent recidivism prediction
mongoed: predicting with limited data - increasing the accuracy in vis-nir  diffuse reflectance spectroscopy by smote
mongoed: tuning free orthogonal matching pursuit
mongoed: optimal graphon estimation in cut distance
mongoed: do pay-for-performance incentives lead to a better health outcome?
mongoed: quantile treatment effects in the regression kink design
mongoed: bergm: bayesian exponential random graph models in r
mongoed: one-sided cross-validation for nonsmooth density functions
mongoed: bayesian adaptive bandit-based designs using the gittins index for  multi-armed trials with normally distributed endpoints
mongoed: student-t process quadratures for filtering of non-linear systems with  heavy-tailed noise
mongoed: growing simplified vine copula trees: improving dißmann's alg

AutoReconnect: localhost:12345: [Errno 61] Connection refused