# Do over

Last time, we tried to use pdfminer to mine the [Arpanet Directory](https://www.google.com/books/edition/ARPANET_Directory/AHo-AQAAIAAJ?hl=en&gbpv=1&dq=arpanet+directory&printsec=frontcover) we found on Google Books. We learned some valuable things, such as that the pdf we are working with actually contains _three_ years worth of directories, starting with 1978. But we ran into [issues](https://github.com/pdfminer/pdfminer.six/issues/656) with out of place text while using pdfminer. So instead I tried the simple expedient of `ctrl + a` and it worked like a charm.

I also decided it would be helpful to subdivde the full pdf even further, extracting out only the HOST ACRONYMS AND NETWORKING LIASONS table from 1978. The pdf for this table is found [here](/files/pdfs/arpanet-directory_host-acronyms-1978.pdf). The text version is [here](/files/text/hosts-1978.txt).

In [2]:
import re
import pandas as pd

In [134]:
phone_number_regex = '\([0-9]{3}\)[0-9]{3}-[0-9]{4}'
page_number_regex = r'\n[0-9]{3}\n'
whitespace_removal_regex = r'[^\S\r\n]*([^a-zA-Z\d\s\]])[^\S\r\n]*' 
page_headers = ('HOST ACRONYMS\nACRONYM ADDR . TYPE SPONSOR LIAISON and SITE ADDRESS\n( Dec )', 
                'ARPANET DIRECTORY HOST ACRONYMS\nNIC 46099 Dec. 1978\nACRONYM ADDR . TYPE SPONSOR LIAISON and SITE ADDRESS\n( Dec )\n')
column_names = ['acronym', 'host_address', 'type', 'sponsor', 'liason_name', 'liason_email', 'physical_address']

We start by reading the file into a string and processing it:

1. Remove the first lines (up to and including the column names, which we will not use), and the last line which is a page number.
2. Replace the headers of each page with empty strings. There are two different formats of page headers to deal with.
3. Replace page numbers with empty strings.
4. Remove extra whitespace from around all punctuation marks -- except for right square brackets. There are two instances of those, and the exterior whitespace is necessary.
5. Split the string, using the phone numbers as a sepator, but keeping the separators. You can do this by enclosing the separator in a capture group.
6. Zip the odd and even entries in this list of strings into a list of tuples, corresponding to the entries.

And that'll do. Let's split off the phone numbers into a separate array or now.

In [135]:
with open('text/hosts-1978.txt') as hosts:
    data = ''.join(hosts.readlines()[5:-1])
    for s in page_headers: data = data.replace(s, '')
    data = re.sub(page_number_regex, '', data, 0, re.DOTALL)
    data = re.sub(whitespace_removal_regex, r'\1', data)    # remove extra whitespace
    phone_numbers = re.findall(phone_number_regex, data)
    entries = re.split(phone_number_regex, data, 0, re.DOTALL)

There are still a number of inconsistencies in our data. For instance, there are two occasions of square brackets that are causing issues. For instance, there are multiple occassions where there were no phone numbers. So we will need to manually split those entries, and insert corresponding empty entries in the `phone_numbers` list.

In [136]:
split = re.split(r'([0-9]{5})', entries[3])
entries[3] = ''.join(split[:2])
entries.insert(4, ''.join(split[2:]))

TODO 

- insert missing phone number into list at index 3
- find other missing phone_numbers