# Working with Data & Files

## Interfacing with the OS

[tutorial](https://realpython.com/working-with-files-in-python/#getting-a-directory-listing)  
[os doc](https://docs.python.org/3/library/os.html)  
[Pathlib doc](https://docs.python.org/3/library/pathlib.html)  

In [None]:
import os
import pathlib

In [None]:
os.listdir()

In [None]:
p = pathlib.Path(".")
list(p.iterdir())

In [None]:
sub_p = p / "subdirectory" # nice handling of '/' to create new paths
print(sub_p)

## Plain text

[`open` docs](https://docs.python.org/3/library/functions.html#open)
- first arg: file name;
- second arg: "r": read, "w": write, add a "b" to it for binary

I recommend using the [`with` syntax](https://docs.python.org/3/reference/compound_stmts.html#index-16): it creates a context within which the file is open, and automatically closes it (and clean things up) after it, so you don't need to think about it.

In [None]:
with open("data/linux.txt", "r") as i:
    data_read = i.read()

data_read[:200]
# print(data_read)

In [None]:
data_read.split("\n")[:5] # no newlines

In [None]:
with open("data/linux.txt", "r") as i:
    data_readlines = i.readlines()

data_readlines[:5] # no newlines

## JSON

[JSON (JavaScript Object Notation)](https://www.json.org/json-en.html), nice reference there!  
[docs](https://docs.python.org/3/library/json.html)

In [None]:
import json

In [None]:
d = {
    "opening": {
        "totems": ["night", "moon", "fountain"],
        "tools": ["spoon", "megaphone", "pencil sharpener", "plastic skull"]
    },
    "die-throws": [3,5,4,4,6,1,0],
    "eyes-closed": False,
}

# json.dumps converts a Python object to a JSON-compatible string
d_json = json.dumps(d)
d_json

In [None]:
# you can then write it to a file if you wish
with open("data/performance.json", "w") as o:
    o.write(d_json)

In [None]:
# you can achieve the same result with json.dump
with open("data/performance.json", "w") as o:
    json.dump(d, o)

In [None]:
with open("data/performance.json", "r") as o:
    d_reloaded = json.load(o)
d_reloaded

## Save binary data

[doc](https://docs.python.org/3/library/pickle.html)

In [None]:
import pickle

In [None]:
l = [10, 5, 8, 7]

with open("data/my_list.pkl", "wb") as o:
    pickle.dump(l, o)

In [None]:
with open("data/my_list.pkl", "rb") as i:
    l_reloaded = pickle.load(i)

l_reloaded

## Scraping

[BeautifulSoup doc](https://beautiful-soup-4.readthedocs.io/en/latest/)  
[urllib3 doc](https://urllib3.readthedocs.io/en/stable/)

In [None]:
import pathlib
import urllib3
from bs4 import BeautifulSoup

# convert string dict to dict
# https://stackoverflow.com/a/988251
import ast

def get_largest_res(data):
    data = ast.literal_eval(data)
    return data[str(max([int(k) for k in data.keys()]))]

works_dir = pathlib.Path("wolf-rehfeldt")
img_dir = works_dir / "images"

# create directories for images and markdown files
works_dir.mkdir(exist_ok=True)
img_dir.mkdir(exist_ok=True)

# url of the site
url = "https://www.richardsaltoun.com/viewing-room/7-ruth-wolf-rehfeldt-letters/"

# set up urllib3 PoolManager
http = urllib3.PoolManager()

# fetch the page content
response = http.request('GET', url)
soup = BeautifulSoup(response.data, 'html.parser')

In [None]:
# write the html into a file for examination
with (works_dir / "soup.html").open("w") as o:
    o.write(str(soup))

In [None]:
# after checking the html manually, found the class of the container with all the images
artworks_items = soup.select(".panel_type_6")[0]
# print(artworks_items)

# select all `img` inside it
image_divs = artworks_items.select("img")
# print(image_divs)

# select all legends inside it
legends = artworks_items.select(".content")
# print(legends)

for i, (img_div, legend) in enumerate(zip(image_divs, legends)):
    
    # print("-" * 40)
    # print(i)
    # print(img_div)
    # print(legend)
    
    # get the link to the pic in the largest resolution
    artwork_max_res_url = get_largest_res(img_div["data-responsive-src"])
    extension = os.path.splitext(artwork_max_res_url)[1]
    # print(artwork_max_res_url)
    # print(extension)

    # create a filename out of the legend
    image_filename = f"{legend.get_text(strip=True).replace('/','-').replace(', ', ',').replace(' ', '-')}{extension}"
    # print(image_filename)    
    
    # Download the image
    img_response = http.request('GET', artwork_max_res_url)
    if img_response.status == 200:
        print(f" - image downloaded, writing to {img_dir}/{image_filename}")
        with open(img_dir / image_filename, 'wb') as o:
            o.write(img_response.data)
    else:
        print(f" - image {image_filename}, url response status: {img_response.status}")