# Do over

Last time, we tried to use pdfminer to mine the [Arpanet Directory](https://www.google.com/books/edition/ARPANET_Directory/AHo-AQAAIAAJ?hl=en&gbpv=1&dq=arpanet+directory&printsec=frontcover) we found on Google Books. We learned some valuable things, such as that the pdf we are working with actually contains _three_ years worth of directories, starting with 1978. But we ran into [issues](https://github.com/pdfminer/pdfminer.six/issues/656) with out of place text while using pdfminer. So instead I tried the simple expedient of `ctrl + a` and it worked like a charm.

I also decided it would be helpful to subdivde the full pdf even further, extracting out only the HOST ACRONYMS AND NETWORKING LIASONS table from 1978. The pdf for this table is found [here](/files/pdfs/arpanet-directory_host-acronyms-1978.pdf). The text version is [here](/files/text/hosts-1978.txt).

In [292]:
import re

## Initial Parsing

We start by reading the file into a string and processing it:

1. Remove the first lines (up to and including the column names, which we will not use), and the last line which is a page number.
2. Replace the headers of each page with empty strings. There are two different formats of page headers to deal with.
3. Replace page numbers with empty strings.
4. Remove extra whitespace from around all punctuation marks -- except for right square brackets. There are two instances of those, and the exterior whitespace is necessary.
5. Removing all extensions from phone numbers. We won't be calling any of these numbers, and the inconsistency will mess up our parsing.
5. Split the string, using the phone numbers as a sepator, but keeping the separators.
6. Zip the odd and even entries in this list of strings into a list of tuples, corresponding to the entries.

And that'll do. Let's split off the phone numbers into a separate array or now.

In [293]:
phone_number_regex = '\([0-9]{3}\)[0-9]{3}-[0-9]{4}'
extension_regex = ' ext [0-9]{1,}'
page_number_regex = r'\n[0-9]{3}\n'
contact_info_regex = r'\w{1,},(?:\w|[\s.]){1,}\(\w{1,}@[a-zA-Z0-9\-]{1,}\)'
whitespace_removal_regex = r'[^\S\r\n]*([^a-zA-Z\d\s\]])[^\S\r\n]*'
page_headers = ('HOST ACRONYMS\nACRONYM ADDR . TYPE SPONSOR LIAISON and SITE ADDRESS\n( Dec )', 
                'ARPANET DIRECTORY HOST ACRONYMS\nNIC 46099 Dec. 1978\nACRONYM ADDR . TYPE SPONSOR LIAISON and SITE ADDRESS\n( Dec )\n')
column_names = ['acronym', 'host_address', 'type', 'sponsor', 'liason_name', 'liason_email', 'physical_address']

In [294]:
with open('text/hosts-1978.txt') as hosts:
    data = ''.join(hosts.readlines()[5:-1])
    for s in page_headers: data = data.replace(s, '')                  # remove page headers
    data = re.sub(page_number_regex, '', data, 0, re.DOTALL)           # remove page numbers
    data = re.sub(whitespace_removal_regex, r'\1', data)               # remove extra whitespace
    data = re.sub(extension_regex, '', data)                           # remove extensions
    phone_numbers = re.findall(phone_number_regex, data)               # save phone numbers   
    entries = re.split(phone_number_regex, data, 0, re.DOTALL)         

##  Inconsistencies

There are still a number of inconsistencies in our data. Here is a list.

1. Issues with square brackets.
2. Multiple phone numbers (two instances)
3. Multiple extensions (one instance)
4. Ellipses for TYPE and SPONSOR (two instance)
5. Interfering strings 

Here are some helper functions to allow us to find entries more easily.

In [295]:
def startswith(s, t):
    return s.startswith(t)

def is_in(s, t):
    return t in s

def find_entry(entries, substr, method):
    for entry in entries:
        if method(entry, substr):
            return entries.index(entry)
    return -1

### 1. Lack of phone numbers

There are multiple occassions where there were no phone numbers. So we will need to manually split those entries, and insert corresponding empty entries in the `phone_numbers` list.

Some of these can be split on the zip code, but others need to be handled differently.

In [296]:
def split_on_zip(index):
    split = re.split(r'([0-9]{5})', entries[index])
    entries[index] = ''.join(split[:2])
    entries.insert(index + 1, ''.join(split[2:]))
    phone_numbers.insert(index, 'NA')

split_on_zip(find_entry(entries, '\n[ALMSA-TIP]', startswith))
split_on_zip(find_entry(entries, '\nMIT-DEV', startswith))
    
# BRAGG-TIP entry doesn't have a zip code listed
bragg_tip_index = find_entry(entries, '\nBRAGG-TIP', startswith)
split = re.split(r'(Carolina)', entries[bragg_tip_index])
entries[bragg_tip_index] = ''.join(split[:2])
entries.insert(bragg_tip_index + 1, split[2])
phone_numbers.insert(bragg_tip_index, 'NA')

### 2. Multiple phone numbers or extensions

There are also instances of entries with either two phone numbers or two extensions. For example:

In [297]:
entries[find_entry(entries, ' or ', startswith)]

' or 274-9151\nDCA DCA Czahor,Raymond(DCACODE535@ISI)\nDefense Communications Agency\nAttn:Code 535,Arpanet Management Branch\nWashington,D.C.20305\n'

The first line in this entry is actually the second phone number from the previous entry. We will just remove all such second phone numbers and extensions, since we won't be calling them anyway. Of course, this is not the only format of secondary numbers: some are comma separated.

In [298]:
for i, entry in enumerate(entries):
  entries[i] = re.sub(r'( or |,)([0-9]{4}|[0-9]{3}-[0-9]{4})', '', entry, 0)
  

In [299]:
find_entry(entries, ' or ', startswith)

-1

And of course that isn't the only format. Sometimes the secondary number is only comma separated.

## Ellipses for TYPE and SPONSOR columns

Now take another look at that entry. Note that in the original, there are two sets of ellipses for columns TYPE and SPONSOR. I'm not sure why they disappeared in the copy/pasting, but let's insert `NA` in those places. There are actually two instances of this to correct. 

In [300]:
strings = ['NIC DCA', 'DCA DCA']
for s in strings:
  index = find_entry(entries, s, is_in)
  entries[index] = re.sub(s, f'{s[:3]} NA NA DCA', entries[index])
  print(entries[index])


NDRE 1/41 USER,ARPA
VDH
Lundh,Yngvar G.(YNGVAR@SRI-KA)
Norwegian Defence Research
Establishment
P.O.Box 25
2007 Kjeller
NORWAY
(02)712660
NDRE-GATEWAY 3/41 USER,ARPA
VDH
Lundh,Yngvar G.(YNGVAR@SRI-KA)
Norwegian Defence Research
Establishment
P.O.Box 25
2007 Kjeller
NORWAY
(02)712660
NIC NA NA DCA Feinler,Elizabeth(FEINLER@SRI-KL)
SRI International
Network Information Center
Room J2021 333 Ravenswood Avenue
Menlo Park,California 94025


DCA NA NA DCA Czahor,Raymond(DCACODE535@ISI)
Defense Communications Agency
Attn:Code 535,Arpanet Management Branch
Washington,D.C.20305



This brings us to yet another issue -- European phone numbers!

In [301]:
entries[index]

'\nDCA NA NA DCA Czahor,Raymond(DCACODE535@ISI)\nDefense Communications Agency\nAttn:Code 535,Arpanet Management Branch\nWashington,D.C.20305\n'

### 4. Interfering strings

There are several entries with interfering strings such as "Up intermittently" inserted below the main line of the entry. Let's remove them.


In [302]:
for entry in entries:
  if 'Sharing' in entry:
    print(entry)


CCTC 0/20 SERVER,CCTC Limited
Sharing port with EDN-UNIX
Corrigan,Michael(corrigan@CCTC)
Defense Communications Agency
Command and Control Technical
Center
11440 Isaac Newton Square
Reston,Virginia 22090

EDN-UNIX 0/20 SERVER,DCEC
Limited
Sharing port with CCTC
Margolis,Abe(DLUGOS@BBNB)
Defense Communications
Engineering Center
Code R820
1860 Wiehle Avenue
Reston,Virginia 22090

Sharing port with ROCHESTER
RAND-RCC 077 SERVER ARPA Wahrman,Mike(mike@RAND-UNIX)
The Rand Corporation
1700 Main Street
Santa Monica,California 90406


Sharing port with RADC-XPER
SAT-VDH 3/63 USER,ARPA
VDH
Bressler,Robert(BRESSLER@BBNE)
Bolt Beranek and Newman Inc.
50 Moulton Street
Cambridge,Massachusetts 02138



In [303]:
interfering_strings = ['\nSharing port with EDN-UNIX\n', 
                       '\nUp intermittently\n',
                       '\nSharing port with CCTC\n',
                       '\nTo be connected to IMP 33\n',  # I'm not sure what happened to the 1/79 from this line
                       '\nSharing port with ROCHESTER\n',
                       '\nSharing port with RADC-XPER\n']

In [304]:
for string in interfering_strings:
  index = find_entry(entries, string, is_in)
  entries[index] = re.sub(string, '\n', entries[index])
  print(entries[index])


CCTC 0/20 SERVER,CCTC Limited
Corrigan,Michael(corrigan@CCTC)
Defense Communications Agency
Command and Control Technical
Center
11440 Isaac Newton Square
Reston,Virginia 22090

DEC-MARLBORO 1/37 USER ARPA
Gartley,Carl(LCAMPBELL@SRI-KL)
Digital Equipment Corporation
DEC System-10 Engineering
200 Forest Street
Marlborough,Massachusetts 01752


EDN-UNIX 0/20 SERVER,DCEC
Limited
Margolis,Abe(DLUGOS@BBNB)
Defense Communications
Engineering Center
Code R820
1860 Wiehle Avenue
Reston,Virginia 22090

FNWC 1/64 USER ARPA
Bradford,Brian E.(FNWC@SRI-KA)
Navy Fleet Numerical Weather
Central
Monterey,California 93940


RAND-RCC 077 SERVER ARPA Wahrman,Mike(mike@RAND-UNIX)
The Rand Corporation
1700 Main Street
Santa Monica,California 90406


SAT-VDH 3/63 USER,ARPA
VDH
Bressler,Robert(BRESSLER@BBNE)
Bolt Beranek and Newman Inc.
50 Moulton Street
Cambridge,Massachusetts 02138



There are a number of issues surrounding how the TYPE column is handled. Most entries are a single word, such as `USER` or `SERVER`. But many are rendered on multiple lines. Examples include: `SERVER,\nLimited`, `USER,VDH\nMagtape,`, `SERVER\nLimited\VDH`.

This would not be a problem, except for the fact that the subsequent lines are not kept grouped together. Here is an example:

In [305]:
find_entry(entries, '\nMIT-DEV', startswith)

71

The substring `Magtape,` is on a new line _in between the data from the sponsor and liason columns_. This is a problem.

I am hoping to manage this by matching all the contact info, which seems doable, although there is on entry that is out of format.

TODO

1. Handle extra lines in TYPE column (try using contact info regex)

2. Handle non-US phone numbers
    a. 01-387-7050(UK)