# opening and writing files

To start we'll just try opening, reading, and playing with some txt files.


In [15]:
from glob import glob

glob('../../../../../DATA/gb/*.txt')

['../../../../../DATA/gb/pg768.txt',
 '../../../../../DATA/gb/pg26184.txt',
 '../../../../../DATA/gb/pg1513.txt',
 '../../../../../DATA/gb/pg1342.txt',
 '../../../../../DATA/gb/pg2641.txt',
 '../../../../../DATA/gb/pg2701.txt',
 '../../../../../DATA/gb/pg145.txt',
 '../../../../../DATA/gb/pg84.txt',
 '../../../../../DATA/gb/pg11.txt',
 '../../../../../DATA/gb/pg37106.txt']

In [23]:
# we probably want to see just the file names, so do a for loop
for filename in glob('../../../../../DATA/gb/*.txt'):
    print(filename.split('/')[-1])

# or you can do this without a for loop, with just a list comprehension:
print([x.split('/')[-1] for x in glob('../../../../../DATA/gb/*.txt')])

pg768.txt
pg26184.txt
pg1513.txt
pg1342.txt
pg2641.txt
pg2701.txt
pg145.txt
pg84.txt
pg11.txt
pg37106.txt
['pg768.txt', 'pg26184.txt', 'pg1513.txt', 'pg1342.txt', 'pg2641.txt', 'pg2701.txt', 'pg145.txt', 'pg84.txt', 'pg11.txt', 'pg37106.txt']


In [25]:
# try opening one of them:
with open('../../../../../DATA/gb/pg768.txt') as f:
    raw_text = f.read()

print(raw_text[:200])

The Project Gutenberg eBook of Wuthering Heights
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
wh


#### Hm, yes, that works
But we still don't know what the others are
#### Well, we could open each file to check, but, why not try extracting each title?

In [35]:
# for that, we need regex!
import re

# this is the pattern we need
title_pattern = re.compile(r"The Project Gutenberg eBook of\s+(.*?)(?:,|\n)", re.IGNORECASE)

# re.compile(pattern, flags) prepares a regex object from your pattern so you can reuse it efficiently.
# then we can reuse the title_pattern, instead of writing the pattern each time

### Defining the title pattern (regex)
A regular expression (regex) is a pattern for matching text, like a blueprint for searching or extracting strings.

This line **does not search the text yet**.  
It defines a *pattern* that we will later apply to many Gutenberg files.

Think of it as writing down a rule before using it.

```
The Project Gutenberg eBook of\s+(.*?)(?:,|\n)
```

**Naive assumption:**  
Gutenberg titles usually appear after  
`"The Project Gutenberg eBook of"`.

**Pattern breakdown (left → right):**

- `The Project Gutenberg eBook of`  
  fixed header phrase

- `\s+`  
  one or more whitespace characters

- `(.*?)`  
  capture the title  
  (`.` any character, `*` any number, `?` stop early)

- `(?:,|\n)`  
  stop at a comma or a newline  
  (`?:` = boundary only, not captured)

**Goal:**  
Extract a plausible title from noisy headers at scale — not perfect, but practical.


In [36]:
# can we print the titles for all our unknown files?
tmp = {}

for filename in glob('../../../../../DATA/gb/*.txt'):
    with open(filename) as f:
        raw_text = f.read()
    
    match = title_pattern.search(raw_text)
    if match:
        title = match.group(1).strip()
    else:
        title = None  # or "UNKNOWN"

    tmp[filename.split('/')[-1]] = title

# let's print it!
tmp

{'pg768.txt': 'Wuthering Heights',
 'pg26184.txt': 'Simple Sabotage Field Manual',
 'pg1513.txt': 'Romeo and Juliet',
 'pg1342.txt': 'Pride and Prejudice',
 'pg2641.txt': 'A Room with a View',
 'pg2701.txt': 'Moby Dick; Or',
 'pg145.txt': 'Middlemarch',
 'pg84.txt': 'Frankenstein; Or',
 'pg11.txt': "Alice's Adventures in Wonderland",
 'pg37106.txt': 'Little Women; Or'}

# Awesome.

Well we're missing some stuff. Can you adjust the pattern to get the full titles, e.g., "Little Women; Or ..." 


In [43]:
# adjust ONLY the pattern:
title_pattern = re.compile(r"The Project Gutenberg eBook of\s+(.*?)(?:,|\n)", re.IGNORECASE)

# then run this loop
tmp = {}

for filename in glob('../../../../../DATA/gb/*.txt'):
    with open(filename) as f:
        raw_text = f.read()
    
    match = title_pattern.search(raw_text)
    if match:
        title = match.group(1)
    else:
        title = None  # or "UNKNOWN"

    tmp[filename.split('/')[-1]] = title

# let's print it!
tmp

{'pg768.txt': 'Wuthering Heights',
 'pg26184.txt': 'Simple Sabotage Field Manual',
 'pg1513.txt': 'Romeo and Juliet',
 'pg1342.txt': 'Pride and Prejudice',
 'pg2641.txt': 'A Room with a View',
 'pg2701.txt': 'Moby Dick; Or',
 'pg145.txt': 'Middlemarch',
 'pg84.txt': 'Frankenstein; Or',
 'pg11.txt': "Alice's Adventures in Wonderland",
 'pg37106.txt': 'Little Women; Or'}

## Another way 
we can do this is get the text directly from Gutenberg, using requests.


In [1]:
import requests

https://www.gutenberg.org/ebooks/345


In [None]:


r = requests.get("https://gutendex.com/books/345")
data = r.json()

data

{'id': 345,
 'title': 'Dracula',
 'authors': [{'name': 'Stoker, Bram', 'birth_year': 1847, 'death_year': 1912}],
 'summaries': ['"Dracula" by Bram Stoker is a Gothic horror novel published in 1897. Told through letters, diary entries, and newspaper articles, the story follows solicitor Jonathan Harker\'s terrifying encounter with Count Dracula in Transylvania. When the vampire Count travels to England and begins preying on victims in Whitby, a small group led by Professor Abraham Van Helsing must hunt him down. This seminal work of Gothic fiction has become the centrepiece of vampire literature, profoundly shaping the popular conception of vampires for generations. (This is an automatically generated summary.)'],
 'editors': [],
 'translators': [],
 'subjects': ['Dracula, Count (Fictitious character) -- Fiction',
  'Epistolary fiction',
  'Gothic fiction',
  'Horror tales',
  'Transylvania (Romania) -- Fiction',
  'Vampires -- Fiction',
  'Whitby (England) -- Fiction'],
 'bookshelves':

In [None]:
text_url = data["formats"]["text/plain; charset=utf-8"]
text = requests.get(text_url).text