# Working with Data & Files

In this notebook, we will learn how to handle files and data in Python (loading from disk, saving to disk).

## Interfacing with the OS

[tutorial](https://realpython.com/working-with-files-in-python/#getting-a-directory-listing)  
[os doc](https://docs.python.org/3/library/os.html)  
[Pathlib doc](https://docs.python.org/3/library/pathlib.html)  

To deal with files, directories, etc., there are two main ways: `os` is the older one, and `pathlib` the new one. I'll be showing tricks I use regularly, using both approaches (often they overlap).

In [None]:
import os
import pathlib

In [None]:
# lists the content of the current directory
os.listdir()

In [None]:
# with pathlib, you first create a 'path' object, then you can iterate
p = pathlib.Path(".")
list(p.iterdir())

In [None]:
# nice pathlib handling of '/' to create new paths
# with os, you'd need to do os.path.join(first_path, "second_path")
sub_p = p / "subdirectory"
print(sub_p)

### Creating / Deleting Directories

[`os.mkdir` doc](https://docs.python.org/3/library/os.html#os.mkdir)  
[`os.makedirs` doc](https://docs.python.org/3/library/os.html#os.makedirs)  
[`os.rmdir` doc](https://docs.python.org/3/library/os.html#os.rmdir)  
[`os.removedirs` doc](https://docs.python.org/3/library/os.html#os.removedirs) 


**NOTE: DELETING WITH PYTHON ISNT'T "MOVING TO THE BIN".**  
**When you remove something, it's *gone*. Be careful!**

In [None]:
test_dirs = "test/subtest"

# to create a directory, you use either the old way (you cannot create
# more than one, like "dir/other-dir/", and it throws an error if the
# directory exists)
# os.mkdir(test_dirs)

# or the new way, where you can create nested dir (for it not to throw
# an error if the directory exists, use `exist_ok=True`)
os.makedirs(test_dirs, exist_ok=True)

In [None]:
# similarly, to remove a directory, the old way is this,
# removing only the *last* directory ("test" would remain)
# (but you need to delete all the files inside first)
# os.rmdir(test_dirs)

# to remove multiple directories, you can use this
# (the last directory must still be empty)
os.removedirs(test_dirs)


### Renaming files/directories

In [None]:
# notice the `empty.txt` file (it is empty)
os.listdir()

In [None]:
# rename it (this also works for directories, and they don't need to be empty!)
os.rename("empty.txt", "still-empty.txt")

# has it changed?
os.listdir()

In [None]:
# rename it back
os.rename("still-empty.txt", "empty.txt")

#### Extra: `shutil`


[doc](https://docs.python.org/3/library/shutil.html). 
[RealPython tutorial](https://realpython.com/ref/stdlib/shutil/)

[`shutil.rmtree` doc](https://docs.python.org/3/library/shutil.html#shutil.rmtree)  
[`shutil.move` doc](https://docs.python.org/3/library/shutil.html#shutil.move)  
[`shutil.copy` doc](https://docs.python.org/3/library/shutil.html#shutil.copy)  


One thing you can do with `shutil` is removing directories with files in them (even more dangerous – there is no undoing this):

```python
import shutil

# this will recursively remove everything
shutil.rmtree("directory_with_things_in_it")

# same as `os.rename`, but guaranteed to work in all cases
shutil.move("old-file-name", "new-file-name")
```

## Plain text

Important note! You are probably used to working with text in something like Word, Page, or some other common word processor. The documents you have there (`.docx`, `.odt`, etc.) are in fact closer to webpages than 'just text'. That means that a lot of formatting exists around it that you don't see when you edit the file! If you want to see how deep the rabbit hole goes, right click on one `.docx` file, and select "uncompress" instead of "open" or "open with" – you will see that actually those files are **zip** files, with many weird files inside them!

If you were to load a `.docx` file into Python, you would also get all that formatting (which is written in text as well), and that's not what we want. Even the simpler `.rtf` ("rich text format") that you can select sometimes when saving file still adds "unseen formatting" text around the text you write.

Instead, ["plain text"](https://en.wikipedia.org/wiki/Plain_text) is what is contained in this cell (what you copy-paste if you enter this cell to edit it, and "select all" to copy). Simply a number of characters, that under the hood are each have a particular binary representation.

Note also that a notebook like this one **also** hides quite a lot of formatting behind it – but a `.py` (or a `.js`, or in fact a `.txt`) file does not! In those, what we see is what we get, no additional text, and that's what we need to work with.

## Open and load files

[`open` docs](https://docs.python.org/3/library/functions.html#open)
- first arg: file name;
- second arg: "r": read, "w": write, add a "b" to it for binary

[w<sup>3</sup> tutorial](https://www.w3schools.com/python/python_file_open.asp), [RealPython tutorial](https://realpython.com/python-with-statement/)

`open` is the tool we use both to open existing files and to create new ones! It also works really well with a particular syntax called a [`context manager`](https://book.pythontips.com/en/latest/context_managers.html) (don't worry about knowing this) that uses the keyword `with`: it creates a context within which the file is open, and automatically closes it (and clean things up) after it, so you don't need to think about manually doing the opening and closing.

In [None]:
# there's a file in data/, let's open it
# store the file object in a variable called 'i' (for 'input')
with open("data/linux.txt", "r") as i:
    # to actually get the data, you need to invoke `.read()`
    data_read = i.read()

# here's some text, the first 200 chars!
data_read[:200]
# print(data_read)

In [None]:
# now we can handle it the way we're familiar with,
# here spliting it on newlines, taking the first 5 lines
data_read.split("\n")[:3] 

In [None]:
# in some cases we know we want to work on lines,
# so instead of the raw `read()` we can invoke `readlines`
with open("data/linux.txt", "r") as i:
    data_readlines = i.readlines()

# beware: with this method the newlines are preserved!
# (each line ends with '\n')
data_readlines[:3]

## JSON

[docs](https://docs.python.org/3/library/json.html)
[w<sup>3</sup> tutorial](https://www.w3schools.com/js/js_json.asp), [Mozilla tutorial](https://developer.mozilla.org/en-US/docs/Learn_web_development/Core/Scripting/JSON)  
[JSON (JavaScript Object Notation)](https://www.json.org/json-en.html), nice reference there!

`JSON` is a very useful format for saving data in a way that is both 1) similar to programming syntax (`lists`, `dictionaries`, ...) but also 2) remains plain text, so very transparent!

It presents itself with this kind of format (like a dictionary, within which you can have lists or other dictionaries):

```json
{
    "key1": "value",
    "key2": ["1", "2", "3"],
    "key3": {
        "nested-key": "nested-value"
    }
    "last-key": "one-last-thing"
}
```

Important difference with Python/JavaScript: **no trailing commas** are allowed (in Python, you could have a comma, or not, after `"3"` or after `"nested-value"`, or after the last element, "one-last-thing").

In [None]:
import json

In [None]:
# create a complex dictionary/object
d = {
    "opening": {
        "totems": ["night", "moon", "fountain",],
        "tools": ["spoon", "megaphone", "pencil sharpener", "plastic skull",],
    },
    "die-throws": [3,5,4,4,6,1,0],
    "eyes-closed": False,
}

# json.dumps converts a Python object to a JSON-compatible string
# (note that all the pesky trailing commas, allowed in Python, have been removed)
d_json = json.dumps(d)
d_json

In [None]:
# you can then write it to a file if you wish
# using "w" (write") instead of "r" (read)
# (here I use 'o' for "output" as a name for the file object)
with open("data/performance.json", "w") as o:
    o.write(d_json)

**NOTE**: just like with removing files, if you have a file of the same name and open it like this for writing, **ANY PREVIOUS DATA WILL BE LOST, NOT REVERSIBLE**.

In [None]:
# you can achieve the same result with json.dump
with open("data/performance.json", "w") as o:
    json.dump(d, o)

In [None]:
# if you want to load your JSON file, do it as if it were plain text
with open("data/performance.json", "r") as o:
    d_reloaded = json.load(o)
d_reloaded

## Save binary data

[doc](https://docs.python.org/3/library/pickle.html)

Often it can be useful to save things not in a human-readable format, but just as raw binary code (often better for memory)! For that, we use the tool `pickle`.

In [None]:
import pickle

In [None]:
# let's define a list
l = [10, 5, 8, 7]

# to write, we use "wb" for "write binary"
with open("data/my_list.pkl", "wb") as o:
    pickle.dump(l, o)

In [None]:
# to read/load, we use "rb" for "read binary"
with open("data/my_list.pkl", "rb") as i:
    l_reloaded = pickle.load(i)

l_reloaded

## Extra: Scraping

[BeautifulSoup doc](https://beautiful-soup-4.readthedocs.io/en/latest/)  
[urllib3 doc](https://urllib3.readthedocs.io/en/stable/)

Very often programmers and hackers use programs to grab data from the Internet and use it in various ways. This is called "scraping" (and a program doing this is a "scraper"). Below is an example of how you would do this for one webpage. This requires two kinds of knowledge: Python, of course, but also how webpages are structured (`html` mostly, but also the other two main components, `css` and `JavaScript` – beyond the scope of this course). Don't worry about this too much – just note that running the following cells should create a folder called `wolf-rehfeldt`, and download a series of images in there for you – beautiful poems that could serve as an inspiration for your own work!

In [None]:
import pathlib
import urllib3
from bs4 import BeautifulSoup

works_dir = pathlib.Path("wolf-rehfeldt")
img_dir = works_dir / "images"

# create directories for images and markdown files
works_dir.mkdir(exist_ok=True)
img_dir.mkdir(exist_ok=True)

# url of the site
url = "https://www.richardsaltoun.com/viewing-room/7-ruth-wolf-rehfeldt-letters/"

# set up urllib3 PoolManager
http = urllib3.PoolManager()

# fetch the page content (this is the **whole web page code**, all the html, as plain text)
response = http.request('GET', url)
soup = BeautifulSoup(response.data, 'html.parser')

In [None]:
# write the html into a file for examination
with (works_dir / "soup.html").open("w") as o:
    o.write(str(soup))

In [None]:
# convert string dict to dict
# https://stackoverflow.com/a/988251
import ast

# this is is to select the largest image format from the object on the web page
def get_largest_res(data):
    data = ast.literal_eval(data)
    return data[str(max([int(k) for k in data.keys()]))]

# after checking the html manually, found the class of the container with all the images
artworks_items = soup.select(".panel_type_6")[0]
# print(artworks_items)

# select all `img` inside it
image_divs = artworks_items.select("img")
# print(image_divs)

# select all legends inside it
legends = artworks_items.select(".content")
# print(legends)

for i, (img_div, legend) in enumerate(zip(image_divs, legends)):
    
    # print("-" * 40)
    # print(i)
    # print(img_div)
    # print(legend)
    
    # get the link to the pic in the largest resolution
    artwork_max_res_url = get_largest_res(img_div["data-responsive-src"])
    extension = os.path.splitext(artwork_max_res_url)[1]
    # print(artwork_max_res_url)
    # print(extension)

    # create a filename out of the legend
    image_filename = f"{legend.get_text(strip=True).replace('/','-').replace(', ', ',').replace(' ', '-')}{extension}"
    # print(image_filename)    
    
    # Download the image
    img_response = http.request('GET', artwork_max_res_url)
    if img_response.status == 200:
        print(f" - image downloaded, writing to {img_dir}/{image_filename}")
        with open(img_dir / image_filename, 'wb') as o:
            o.write(img_response.data)
    else:
        print(f" - image {image_filename}, url response status: {img_response.status}")