# Download Oracc HTML Files

Download script for ORACC files. The downloader needs an input file that lists the P, Q, or X numbers to be downloaded. It will save these files in the /HTML directory. The `Scrape Oracc` Notebook may be used to further process these files.

## Setting up the Environment

This cell imports the required packages and checks under which Python version the script is running. Recommended (and tested) is Python 3.5.

The function `patch_http_response_read()` takes care of the IncompleteRead error, which may terminate the script. This function was found in the blog post [Beaver Notes](http://bobrochel.blogspot.com/2010/11/bad-servers-chunked-encoding-and.html) and has been adapted minimally for Python 3.5.

Other solutions for the IncompleteRead error have been suggested on blogs. It is possible that some of these solutions are better and/or faster - they have not been tried. The current script has been stress-tested with a list of more than 3,000 P numbers, which took approximately 45 minutes.

In [1]:
from __future__ import print_function

import urllib.request
import re
import sys
import os
import time
from tqdm import *
import http.client as httplib

PY3 = sys.version_info.major == 3
print("Running under Python version:", sys.version_info[:3])

if not PY3:
    input = raw_input

def patch_http_response_read(func):
    def inner(*args):
        try:
            return func(*args)
        except httplib.IncompleteRead as e:
            return e.partial

    return inner
httplib.HTTPResponse.read = patch_http_response_read(httplib.HTTPResponse.read)

Running under Python version: (3, 5, 1)


# Input File

The input file should be located in a directory called /Input, which must be located in the directory in which this Python Notebook is executed. The file should have a .txt extension and must be created with a flat text editor such as TextEdit, Notepad, or Emacs. The file contains a simple list of P, Q, or X numbers, preceded by the ORACC abbreviation where the file is edited. For instance:

    rinap/rinap1/Q003421
    dcclt/Q000039
    cams/gkab/P348623



In [2]:
inputFile = input("Name of Input List: ")

Name of Input List: test.txt


In [3]:
with open('Input/' + inputFile, mode = 'r') as f:
    textlist = f.read().splitlines()

In [4]:
if not os.path.exists('HTML'):
    os.mkdir('HTML')
for eachtextid in tqdm(textlist):
    time.sleep(.01)
    eachtextid = eachtextid.rstrip()
    url = 'http://oracc.org/' + eachtextid + '/html'
    print('retrieving ' + url)
    with urllib.request.urlopen(url) as currentFile:
        f = currentFile.read()
        # if file has content, save
        #identify P, Q, or X number and check that it appears in the file
    textid = eachtextid[-7:]
    if textid in str(f):
        # replace / by _ in eachtextid in output filename
        filename = 'HTML/' + eachtextid.replace('/', '_') + '.html'
        print('saving ' + 'http://oracc.org/' + eachtextid + ' as ' + filename)
        with open(filename, mode='wb') as writeFile:
            writeFile.write(f)
    else:
        print(eachtextid + ' not available')


  0%|          | 0/6 [00:00<?, ?it/s]

retrieving http://oracc.org/rinap/rinap1/Q003421/html


 17%|█▋        | 1/6 [00:00<00:01,  3.93it/s]

saving http://oracc.org/rinap/rinap1/Q003421 as HTML/rinap_rinap1_Q003421.html
retrieving http://oracc.org/dcclt/Q000039/html


 33%|███▎      | 2/6 [00:00<00:01,  3.84it/s]

saving http://oracc.org/dcclt/Q000039 as HTML/dcclt_Q000039.html
retrieving http://oracc.org/cams/gkab/P348623/html


 50%|█████     | 3/6 [00:00<00:00,  3.96it/s]

saving http://oracc.org/cams/gkab/P348623 as HTML/cams_gkab_P348623.html
retrieving http://oracc.org/saao/saa10/P334751/html


 67%|██████▋   | 4/6 [00:00<00:00,  4.11it/s]

saving http://oracc.org/saao/saa10/P334751 as HTML/saao_saa10_P334751.html
retrieving http://oracc.org/dcclt/Q000043/html


 83%|████████▎ | 5/6 [00:01<00:00,  3.58it/s]

saving http://oracc.org/dcclt/Q000043 as HTML/dcclt_Q000043.html
retrieving http://oracc.org/blms/P274259/html


100%|██████████| 6/6 [00:01<00:00,  3.81it/s]

saving http://oracc.org/blms/P274259 as HTML/blms_P274259.html



