In [6]:
import re

# Initial (mis)steps

I initially tried using `pdfminer.six` to convert the pdf to text. I spent a while working with this before realizing that the text that I am interested in has been incorrectly converted in multiple ways, with no clear pattern to discern. See the [issue](https://github.com/pdfminer/pdfminer.six/issues/656) I posted on github for details. The rest of this notebook describes those first steps.


## First Steps

First, I used pdfminer.six to convert the Arpanet Directory into a text file. This resulted in some annoying strings of nonsense spread out over many lines. I wanted to remove this, and determined that it was bookended by strings of the form

```
'ARPANET\n\nDIRECTORY'
```

and 

```
'\x0cARPANET\nDIRECTORY\n\nCONTENTS'
```

There are four instances of each string. I thought at first that this must corresponds to the beginnings of four different directories (ie, for the years 1978 - 81). But it seems actually that there are only three years represented, but there are some duplicate pages somewhere. I removed the superfluous text with this:

In [139]:
with open('arpanet_directory_1978-1980.txt', 'r') as f:
    s = ''.join(f.readlines())
    s = re.sub(r'(ARPANET\n\nDIRECTORY).*?(\x0cARPANET\nDIRECTORY\n\nCONTENTS)', r'\2', s, 0, re.DOTALL)

## Split according to year

Since there are three different directories in this one pdf, it might be useful to split the file. The last page in each directory is for FREQUENTLY CALLED NUMBERS. I split them by these pages. 

In [147]:
split = s.split('FREQUENTLY CALLED NUMBERS\n\nName\n\nPhone\n\nName\n\nPhone')
print(len(split))

4


In [152]:
# The fourth item in the list is the Stanford University call number page for the overall document, it can be ignored. 
# Let's write these to files:
i = 0
for year in [1978, 1979, 1980]:
    file = open(f"arpanet_directory_{year}.txt", "w")
    file.write(split[i])
    i += 1
file.close()

In [151]:
# and also let's save the split strings as separate variables to make them easier to work with.
[ad1978, ad1979, ad1980, _] = split

A bit more investigating into these files indicates that there are formatting differences from year to year. For instance, the host acronyms are stored in a table labelled as "HOST ACRONYMS AND NETWORK LIAISON" in 1978, with a form shown below:

![image.png](attachment:image.png)

But in 1979 and 1980, acronyms are stored in a table called "NETWORK HOST ACRONYMS". Two columns per page, with each column subdivided into two columns of the form 

```
HOST-ACRONYMS    site-address
```

where HOST-ACRONYMS are a newline separated list of all hosts associated with a given site address.
\
![image.png](attachment:image.png)


It was around this point that I realized the issues I mentioned above. 
