The purpose of this notebook is to take the list of already trimmed URLs found and to modify them as below:

From: `http://www.gutenberg.org/ebooks/6130`

To: `http://www.gutenberg.org/ebooks/6130/pg6130.txt`

We need to do this 100 times and so, preferably, not by hand.

We will start with a subset of the URLs with which to experiment:

In [28]:
import re

one_url = 'http://www.gutenberg.org/ebooks/6130'

urls = '''
http://www.gutenberg.org/ebooks/6130
http://www.gutenberg.org/ebooks/1727
http://www.gutenberg.org/ebooks/22381
'''

A quick search on capturing a group of numbers turned up this [SO post](https://stackoverflow.com/questions/6711567/how-to-use-python-regex-to-replace-using-captured-group/24514054).

In [7]:
# First try based on SO post above:
p = re.compile(r"(?P<number>\d)")
p.sub('\g<number>/pg\g<number>.txt', one_url)

'http://www.gutenberg.org/ebooks/6/pg6.txt1/pg1.txt3/pg3.txt0/pg0.txt'

According to [Regular-Expressions.info](https://www.regular-expressions.info/replacebackref.html):

> In Python, if you have the regex `(?P<name>group)` then you can use its match in the replacement text with `\g<name>`.

In [16]:
# Simplified to make it easier for me to focus:
url_re = re.sub(r"(\d)", r"\1/pg\1.txt", one_url)
print(url_re)

http://www.gutenberg.org/ebooks/6/pg6.txt1/pg1.txt3/pg3.txt0/pg0.txt


In [26]:
# Well, that's fairly close, but we need that number as a group:

url_re = re.sub(r"(\d+)", r"\1/pg\1.txt", one_url)
print(url_re)

http://www.gutenberg.org/ebooks/6130/pg6130.txt


In [48]:
# Now let's loop through a list of URLs, using our sample above:

for item in urls:
    print(item[0:10])

h
t
t
p
:
/
/
w
w
w
.
g
u
t
e
n
b
e
r
g
.
o
r
g
/
e
b
o
o
k
s
/
6
1
3
0


h
t
t
p
:
/
/
w
w
w
.
g
u
t
e
n
b
e
r
g
.
o
r
g
/
e
b
o
o
k
s
/
1
7
2
7


h
t
t
p
:
/
/
w
w
w
.
g
u
t
e
n
b
e
r
g
.
o
r
g
/
e
b
o
o
k
s
/
2
2
3
8
1


h
t
t
p
:
/
/
w
w
w
.
g
u
t
e
n
b
e
r
g
.
o
r
g
/
e
b
o
o
k
s
/
3
1


h
t
t
p
:
/
/
w
w
w
.
g
u
t
e
n
b
e
r
g
.
o
r
g
/
e
b
o
o
k
s
/
4
5
6
3
4


h
t
t
p
:
/
/
w
w
w
.
g
u
t
e
n
b
e
r
g
.
o
r
g
/
e
b
o
o
k
s
/
1
5
2
5
0


h
t
t
p
:
/
/
w
w
w
.
g
u
t
e
n
b
e
r
g
.
o
r
g
/
e
b
o
o
k
s
/
4
0
6
8
6


h
t
t
p
:
/
/
w
w
w
.
g
u
t
e
n
b
e
r
g
.
o
r
g
/
e
b
o
o
k
s
/
4
2
4
7
4


h
t
t
p
:
/
/
w
w
w
.
g
u
t
e
n
b
e
r
g
.
o
r
g
/
e
b
o
o
k
s
/
4
9
2
8


h
t
t
p
:
/
/
w
w
w
.
g
u
t
e
n
b
e
r
g
.
o
r
g
/
e
b
o
o
k
s
/
2
8
4
9
7


h
t
t
p
:
/
/
w
w
w
.
g
u
t
e
n
b
e
r
g
.
o
r
g
/
e
b
o
o
k
s
/
2
2
6
9
3


h
t
t
p
:
/
/
w
w
w
.
g
u
t
e
n
b
e
r
g
.
o
r
g
/
e
b
o
o
k
s
/
1
6
6
5
3


h
t
t
p
:
/
/
w
w
w
.
g
u
t
e
n
b
e
r
g
.
o
r
g
/
e
b
o
o
k
s
/
2
4
7
3
7


h
t
t
p
:
/
/
w
w
w
.
g


See what happens when you feed a string to a loop and not the list you thought it was?

Let's turn the string of URLs into a list:

In [29]:
print(urls)


http://www.gutenberg.org/ebooks/6130
http://www.gutenberg.org/ebooks/1727
http://www.gutenberg.org/ebooks/22381



In [31]:
url_list = urls.split('\n')
print(url_list)

['', 'http://www.gutenberg.org/ebooks/6130', 'http://www.gutenberg.org/ebooks/1727', 'http://www.gutenberg.org/ebooks/22381', '']


In [32]:
# Now let's try that transformation:
for item in url_list:
    url_re = re.sub(r"(\d+)", r"\1/pg\1.txt", item)
    print(url_re)


http://www.gutenberg.org/ebooks/6130/pg6130.txt
http://www.gutenberg.org/ebooks/1727/pg1727.txt
http://www.gutenberg.org/ebooks/22381/pg22381.txt



Okay, now we are ready to open the file with the URLs, transform them, and save them as a new file that we can then feed to **`wget`**.

In [35]:
%pwd

'/Users/jl/Developer/texts/old_folklore_books'

In [38]:
%ls

Regexing_URLs.ipynb  gutenberg_100.txt
[1m[34mgutenberg[m[m/           readme.md


In [46]:
with open("gutenberg_100.txt") as file:
    urls = file.read()
    url_list = urls.split('\n')
    print(url_list[0:3])

['http://www.gutenberg.org/ebooks/6130', 'http://www.gutenberg.org/ebooks/1727', 'http://www.gutenberg.org/ebooks/22381']


In [44]:
g100 = [re.sub(r"(\d+)", r"\1/pg\1.txt", item) for item in url_list]
print(g100[0:3])

['http://www.gutenberg.org/ebooks/6130/pg6130.txt', 'http://www.gutenberg.org/ebooks/1727/pg1727.txt', 'http://www.gutenberg.org/ebooks/22381/pg22381.txt']


In [45]:
with open('g100.txt', 'w') as f:
    for item in g100:
        f.write("%s\n" % item)

In [41]:
%ls

Regexing_URLs.ipynb  [1m[34mgutenberg[m[m/           readme.md
g100.txt             gutenberg_100.txt


With that, we have our file that we can feed to **`wget`** and so this notebook has done its job. Below is the command I used:
```
wget -w 2 -i ../g100.txt
```

One more edit to report. The URLs I created, looked like this:
```
http://www.gutenberg.org/ebooks/6130/pg6130.txt
```
But they needed to be this:
```
http://www.gutenberg.org/cache/epub/6130/pg6130.txt
```
I made the change, swapping `ebooks` for `cache/epub` in a text edtor.

With that edit made, I ran the **`wget`** command above, and received the following response after it completed:
```
FINISHED --2019-01-14 10:56:35--
Total wall clock time: 3m 45s
Downloaded: 71 files, 31M in 21s (1.46 MB/s)
```
So it looks like 29 files turned up as 404s. *Sigh*.