# Do over

Last time, we tried to use pdfminer to mine the [Arpanet Directory](https://www.google.com/books/edition/ARPANET_Directory/AHo-AQAAIAAJ?hl=en&gbpv=1&dq=arpanet+directory&printsec=frontcover) we found on Google Books. We learned some valuable things, such as that the pdf we are working with actually contains _three_ years worth of directories, starting with 1978. But we ran into [issues](https://github.com/pdfminer/pdfminer.six/issues/656) with out of place text while using pdfminer. So instead I tried the simple expedient of `ctrl + a` and it worked like a charm.

I also decided it would be helpful to subdivde the full pdf even further, extracting out only the HOST ACRONYMS AND NETWORKING LIASONS table from 1978. The pdf for this table is found [here](/files/pdfs/arpanet-directory_host-acronyms-1978.pdf). The text version is [here](/files/text/hosts-1978.txt).

In [2]:
import re
import pandas as pd

## Initial Parsing

We start by reading the file into a string and processing it:

1. Remove the first lines (up to and including the column names, which we will not use), and the last line which is a page number.
2. Replace the headers of each page with empty strings. There are two different formats of page headers to deal with.
3. Replace page numbers with empty strings.
4. Remove extra whitespace from around all punctuation marks -- except for right square brackets. There are two instances of those, and the exterior whitespace is necessary.
5. Removing all extensions from phone numbers. We won't be calling any of these numbers, and the inconsistency will mess up our parsing.
5. Split the string, using the phone numbers as a sepator, but keeping the separators.
6. Zip the odd and even entries in this list of strings into a list of tuples, corresponding to the entries.

And that'll do. Let's split off the phone numbers into a separate array or now.

In [160]:
phone_number_regex = '\([0-9]{3}\)[0-9]{3}-[0-9]{4}'
extension_regex = ' ext [0-9]{1,}'
page_number_regex = r'\n[0-9]{3}\n'
contact_info_regex = r'\w{1,},(?:\w|[\s.]){1,}\(\w{1,}@[a-zA-Z0-9\-]{1,}\)'
whitespace_removal_regex = r'[^\S\r\n]*([^a-zA-Z\d\s\]])[^\S\r\n]*'
page_headers = ('HOST ACRONYMS\nACRONYM ADDR . TYPE SPONSOR LIAISON and SITE ADDRESS\n( Dec )', 
                'ARPANET DIRECTORY HOST ACRONYMS\nNIC 46099 Dec. 1978\nACRONYM ADDR . TYPE SPONSOR LIAISON and SITE ADDRESS\n( Dec )\n')
column_names = ['acronym', 'host_address', 'type', 'sponsor', 'liason_name', 'liason_email', 'physical_address']

In [161]:
with open('text/hosts-1978.txt') as hosts:
    data = ''.join(hosts.readlines()[5:-1])
    for s in page_headers: data = data.replace(s, '')                  # remove page headers
    data = re.sub(page_number_regex, '', data, 0, re.DOTALL)           # remove page numbers
    data = re.sub(whitespace_removal_regex, r'\1', data)               # remove extra whitespace
    data = re.sub(extension_regex, '', data)                           # remove extensions
    phone_numbers = re.findall(phone_number_regex, data)               # save phone numbers   
    entries = re.split(phone_number_regex, data, 0, re.DOTALL)         

In [163]:
print(entries[find_entry(entries, '\n[ALMSA-TIP]', startswith)])


[ALMSA-TIP] 2/61 TIP ARMY Nelson,Steve(JAMES@BBNB)
Commander Army Communications Command
Attn:Steve Nelson,CCNC-STL-S
St.Louis,Missouri 63188
AMES-11 3/16 USER ARPA Hart,James P.(HART@AMES-67)
NASA Ames Research Center
Network Graphics Group Mail Stop 233-9 Moffett Field,California 94035



##  Inconsistencies

There are still a number of inconsistencies in our data. Here is a list.

1. Issues with square brackets.
2. Entry 40 -- multiple phone numbers.
3. Interfering strings.

Here are some helper functions to allow us to find entries more easily.

In [147]:
def startswith(s, t):
    return s.startswith(t)

def is_in(s, t):
    return t in s

def find_entry(entries, substr, method):
    for entry in entries:
        if method(entry, substr):
            return entries.index(entry)
    return -1

### 1. Lack of phone numbers

There are multiple occassions where there were no phone numbers. So we will need to manually split those entries, and insert corresponding empty entries in the `phone_numbers` list.

In [148]:
find_entry(entries, '\nMIT-DEV', startswith)

69

In [170]:
# splitting entries[3] into two entries
# split = re.split(r'([0-9]{5})', entries[3])
# entries[3] = ''.join(split[:2])
# entries.insert(4, ''.join(split[2:]))

def split_on_zip(index):
    split = re.split(r'([0-9]{5})', entries[index])
    print(split)
    print(len(split))
    entries[index] = ''.join(split[:2])
    entries.insert(index + 1, ''.join(split[2:]))

split_on_zip(find_entry(entries, '\n[ALMSA-TIP]', startswith))
split_on_zip(find_entry(entries, '\nMIT-DEV', startswith))
    
# # splitting entries[25] into two entries
# split = re.split(r'(Carolina)', entries[25])
# entries[25] = ''.join(split[:2])
# entries.insert(26, split[2])

# # insert missing phone numbers 
# phone_numbers.insert(3, '')
# phone_numbers.insert(25, '')

['\nMIT-DEVMULTICS 3/41 SERVER,ARPA\nLimited\nGreenberg,Bernard S.\n(Greenberg@MIT-MULTICS)\nHoneywell Information Systems Cambridge Information Systems Laboratory\n575 Technology Square,3rd Floor\nCambridge,Massachusetts ', '02139', '\nMIT-DMS 1/6 SERVER ARPA Galley,Stuart W.(SWG@MIT-DMS)\nMassachusetts Institute of\nTechnology\nLaboratory for Computer Science Dynamic Modeling System 545 Technology Square Cambridge,Massachusetts ', '02139', '\n']
5


In [172]:
print(entries[find_entry(entries, '\nMIT-DEV', startswith) + 1])



MIT-DMS 1/6 SERVER ARPA Galley,Stuart W.(SWG@MIT-DMS)
Massachusetts Institute of
Technology
Laboratory for Computer Science Dynamic Modeling System 545 Technology Square Cambridge,Massachusetts 02139



In [155]:
entries[71]

'\nMIT-DEVMULTICS 3/41 SERVER,ARPA\nLimited\nGreenberg,Bernard S.\n(Greenberg@MIT-MULTICS)\nHoneywell Information Systems Cambridge Information Systems Laboratory\n575 Technology Square,3rd Floor\nCambridge,Massachusetts 02139\nMIT-DMS 1/6 SERVER ARPA Galley,Stuart W.(SWG@MIT-DMS)\nMassachusetts Institute of\nTechnology\nLaboratory for Computer Science Dynamic Modeling System 545 Technology Square Cambridge,Massachusetts 02139\n'

### 2. Multiple phone numbers

Entry `41` is problematic too.

In [109]:
print(entries[41])

 or 274-9151
DCA DCA Czahor,Raymond(DCACODE535@ISI)
Defense Communications Agency
Attn:Code 535,Arpanet Management Branch
Washington,D.C.20305



The first line was an alternate number of the previous entry. We can just remove that, the phone numbers are non-essential. But there is also a problem in that two entries are missing from the first remaining line. It reads `DCA DCA Czahor . . .`. But in the actual text, it reads `DCA ... ... DCA Czahor . . .`. I will replace these entries with `NA`.

In [110]:
adjusted = '\n'.join(entries[41].split('\n')[1:])
adjusted = re.sub(r'DCA DCA', r'DCA NA NA DCA', adjusted)
entries[41] = adjusted

In [111]:
print(entries[41])

DCA NA NA DCA Czahor,Raymond(DCACODE535@ISI)
Defense Communications Agency
Attn:Code 535,Arpanet Management Branch
Washington,D.C.20305



### 3. Interfering strings

Entry `44` has the string `"\nUp intermittently\n"` inserted in the middle of its first line. Let's remove it.


In [112]:
entries[44] = re.sub('\nUp intermittently\n', '', entries[44])
print(entries[44])


DEC-MARLBORO 1/37 USER ARPAGartley,Carl(LCAMPBELL@SRI-KL)
Digital Equipment Corporation
DEC System-10 Engineering
200 Forest Street
Marlborough,Massachusetts 01752



There are a number of issues surrounding how the TYPE column is handled. Most entries are a single word, such as `USER` or `SERVER`. But many are rendered on multiple lines. Examples include: `SERVER,\nLimited`, `USER,VDH\nMagtape,`, `SERVER\nLimited\VDH`.

This would not be a problem, except for the fact that the subsequent lines are not kept grouped together. Here is an example:

In [115]:
find_entry(entries, '\nMIT-DEV', startswith)

71

The substring `Magtape,` is on a new line _in between the data from the sponsor and liason columns_. This is a problem.

I am hoping to manage this by matching all the contact info, which seems doable, although there is on entry that is out of format.

TODO

1. Handle extra lines in TYPE column (try using contact info regex)

2. Handle non-US phone numbers
    a. 01-387-7050(UK)