# Do over

Last time, we tried to use pdfminer to mine the [Arpanet Directory](https://www.google.com/books/edition/ARPANET_Directory/AHo-AQAAIAAJ?hl=en&gbpv=1&dq=arpanet+directory&printsec=frontcover) we found on Google Books. We learned some valuable things, such as that the pdf we are working with actually contains _three_ years worth of directories, starting with 1978. But we ran into [issues](https://github.com/pdfminer/pdfminer.six/issues/656) with out of place text while using pdfminer. So instead I tried the simple expedient of `ctrl + a` and it worked like a charm.

I also decided it would be helpful to subdivde the full pdf even further, extracting out only the HOST ACRONYMS AND NETWORKING LIASONS table from 1978. The pdf for this table is found [here](/files/pdfs/arpanet-directory_host-acronyms-1978.pdf). The text version is [here](/files/text/hosts-1978.txt).

In [37]:
import re
import pandas as pd

In [43]:
phone_number_regex = '(\( [0-9]{3} \) [0-9]{3}-[0-9]{4})'
column_names = ['acronym', 'host_address', 'type', 'sponsor', 'liason_name', 'liason_email', 'physical_address']

We start by reading the file into a string and processing it:

1. Remove the first lines (up to and including the column names, which we will not use).
2. Replace the headers of each page with empty strings. There are two different formats of page headers to deal with.
3. Split the string, using the phone numbers as a sepator, but keeping the separators. You can do this by enclosing the separator in a capture group.
4. Zip the odd and even entries in this list of strings into a list of tuples, corresponding to the entries.

And that'll do. We will leave the phone numbers as separate strings inside the tuple.

In [73]:
with open('text/hosts-1978.txt') as hosts:
    data = ''.join(hosts.readlines()[5:])
    data = data.replace('HOST ACRONYMS\nACRONYM ADDR . TYPE SPONSOR LIAISON and SITE ADDRESS\n( Dec )', '')
    print(data)
# this code works to split the entries. 
#     entries = re.split(phone_number_regex, data, 0, re.DOTALL)
#     entries = zip(entries[::2], entries[1::2])


ACCAT - TIP 2/35 TIP ARPA Brennan , Jack ( STEPHENSON@ISI )
Naval Ocean Systems Center
Code 722
217 Catalina Blvd. San Diego , California 92152
( 714 ) 225-2871
AFWL 0/48 SERVER , AFSC
Limited
Havens , Martin ( AFWL@I4 - TENEX )
Air Force Weapons Laboratory
Kirtland Air Force Base Albuquerque , New Mexico 87117 ( 505 ) 264-0319
AFWL - TIP 2/48 TIP AFSC Maull , Roy ( MAULL@BBN - TENEX )
Air Force Weapons Lab./ADPT Kirtland Air Force Base ,
New Mexico 87117
( 505 ) 264-2581
[ ALMSA - TIP] 2/61 TIP ARMY Nelson , Steve ( JAMES @ BBNB )
Commander Army Communications Command
Attn : Steve Nelson , CCNC - STL - S
St. Louis , Missouri 63188
AMES - 11 3/16 USER ARPA Hart , James P. ( HART@AMES - 67 )
NASA Ames Research Center
Network Graphics Group Mail Stop 233-9 Moffett Field , California 94035
( 415 ) 965-6629
AMES - 67 0/16 SERVER ARPA Hathaway , Wayne ( HATHAWAY@AMES - 67 )
NASA Ames Research Center
Computation Division
Mail Stop 233-9 Moffett Field , California 94035 ( 415 ) 965-6033
AMES 

In [74]:
r'''ARPANET DIRECTORY HOST ACRONYMS
NIC 46099 Dec. 1978
ACRONYM ADDR . TYPE SPONSOR LIAISON and SITE ADDRESS
( Dec )'''

'ARPANET DIRECTORY HOST ACRONYMS\nNIC 46099 Dec. 1978\nACRONYM ADDR . TYPE SPONSOR LIAISON and SITE ADDRESS\n( Dec )'