## Extracting Metadata From Files: Exploring Different Widely-Used, Text-Based File Formats

Somehow, we have to store get our data into files; otherwise, we'd lose our data every time we exited Python!  Ideally, the way we store our data will make it easy to read and write, both in our own favorite computational environment and in those of our colleagues, without requiring that everyone develop some ultra-complex custom code.  

That's where standardized file formats come in.  In this set of exercises, we'll practice **serializing** data into a string that we can write into a text file and **deserializing** text into Python data structures, using three different text file formats: 

  - **JSON** (Javascript Object Notation)
  - **XML** (Extensible Markup Language)
  - **YAML** (Yet Another Markup Language)

### File Objects in Python

| **Code** | **Description** |
| :-- | :-- |
| **`f = open('newfile', 'w')`** | Makes a file object that is linked to a newly-opened file called 'newfile' that expects to do text writing. |
| **`f = open('oldfile', 'r')`** | Makes a file object that is linked to an existing file called 'oldfile' that expects to do text reading. **Note**: the 'r' is optional, as it is the default "file mode" for the `open()` function.|
| **`f.write(text)`** | Writes the string in `text` to the text file linked to `f` |
| **`text = f.read()`** | Reads all the text stored in the text file linked to `f` |
| **`f.close()`** | Closes the file that is linked to the file object `f`. |

**Exercises**

**Example**: Write the text, "Hello, World!" into a text file called `hello.txt`, then close the file.

In [1]:
f = open('hello.txt', 'w')
f.write('Hello, World!')
f.close()

Write the text, "Goodbye, everyone.' into a text file called `bye.txt`, then close the file.

In [2]:
f = open('bye.txt', 'w')
f.write('Goodbye, everyone.')
f.close()

Try writing the text, "Does this work?" to the file object after it has already been closed.  You should see an error (after all, the file is not open to writing!)  What type of error do you get?

In [3]:
f.write('Does this work?')


ValueError: I/O operation on closed file.

Read the text from the file `bye.txt` into the string variable `text`.

In [8]:
f = open('bye.txt','r')
text = f.read()
text

'Goodbye, everyone.'

Write the string 'Emma' into a the file "subj001.txt" in the "data/raw" folder:

In [17]:
import os
os.makedirs('./data/raw', exist_ok = True)
f = open('./data/raw/subject001.txt', 'w')
f.write('Emma')

4

Write the number `10` into the file `ten.txt`.

In [34]:
import os
os.makedirs('./data/raw', exist_ok = True)
f = open('./data/raw/ten.txt', 'w')
f.write('10')
f.close()


Read the number from `10.txt` into the integer variable `data`.

In [37]:
f = open('./data/raw/ten.txt')
data = f.read()
print(data)
type(data)

10


str

**Extra, max 3 minutes on this exercise**: The file created below has 5 numbers written into it.  Read the 5 numbers into a python list of floats.

In [74]:
f = open('data.txt', 'w')
f.write('[5, 6.2, 3.12, 1.0000, 2]\n')
f.close()

As you can probably see, this is a bit tricky without some extra utility functions to say how to **parse** the text into a data structure.  If we have a **standard file format**, however, then we can use the same parsing functions on any data file in that format, to get the data in and out of that file. 

Below, we'll review three different text file formats and get some practice with the functions in libraries used to read and write data into those formats.

---

### The JSON and YAML File Formats

Serialization libraries have functions to change data into their data structure; these functions usually come in pairs to indicate writing and reading:
  - `serialize()` and `deserialize()`
  - `parse()` and `unparse()`
  - `dump()` and `load()`

Which pair of terms is used depends on the libary.  

Here, we'll try out YAML and JSON, two very-popular text-based file formats, to take complex dict- and list- data structures and read and write them to files.  A nice reference sheet showing how YAML and JSON syntaxes are related to each other: https://quickref.me/yaml.html  . 


| Code | Description |
| :-- | :-- |
| **Reading** |
| **`data = json.load(f)`** | Read a file object storing JSON data. |
| **`data = yaml.load(f, loader=yaml.Loader)`**  | Read a file object storing YAML data. |
| --- | --- |
| **Writing** | |
| **`json.dump(data, f, indent=3)`** | Write data into a JSON text file. (indent=3 is optional, it makes it easiser to read visually in a text editor)| 
| **`yaml.dump(mdata, f)`** | Write data into a YAML text file.  |



**Exercises**

In [75]:
# %pip install PyYAML

In [76]:
import json
import yaml

**Exercises**: All of the following exercises use the metdata data structure below:

In [77]:
mdata = {
    'metadata': {
        'height': 1080, 
        'width': 1920, 
        'order': 'RGB', 
        'date': '2024-12-24', 
        'subject': {
            'id': 'x134', 
            'name': 'Scratchy', 
            'sources': ['Cartoon', 'The Simpsons Lab, Springfield']
        }, 
        'researchers': ['Itchy', 'Bart', 'Lisa']
    }
} 
mdata

{'metadata': {'height': 1080,
  'width': 1920,
  'order': 'RGB',
  'date': '2024-12-24',
  'subject': {'id': 'x134',
   'name': 'Scratchy',
   'sources': ['Cartoon', 'The Simpsons Lab, Springfield']},
  'researchers': ['Itchy', 'Bart', 'Lisa']}}

**Example**: Write the metadata a JSON file called `recording1.json`:

In [78]:
f = open('recording1.json', 'w')
json.dump(mdata, f, indent=3)
f.close()

Once the `recording1.json` file is created, try opening it up in  a text editor, and get a sense of how the file is written.  

Read the file back into a variable called `data_from_json`.   Was the data read back in correctly?

Write the metadata to a YAML file called `recording1.yml`:

Translate the following sentence to a Python data structure and save it to either a JSON or YAML file called `capture.json` or `capture.yml`: *"The image has a width of 1080 pixels, a height of 720 pixels, saved data in RGB format. The camera settings had an exposure time of 8 milliseconds, an aperture of 2.8 stops, and an ISO setting of 100."*

Read the file back to a Python variable to check that it was saved correctly. (Note: this is sometimes called a "round-trip" test.)

---

### The XML File Format

The XML file format is also used to store data, and is extremely popular; it's even used to store OdML data, which is the metadata format used by Nix!  Here, we'll get a sense of what XML looks like, so that when we see richer metadata files, we can more-easily grok what Nix and OdML are doing.

Even though Python has an `xml` package included in its standard library, it can be quite complex to use.  Here, we're using the simpler `xmltodict` package to do basic reading and writing to xml.

| **Code** | **Description** |
| :--  | :-- |
| **`f = open('file.xml', 'wb')`** | Open a writable file in "binary" mode |
| **`f = open('file.xml', 'rb')`** | Open a readable file in "binary" mode |
| **`xmltodict.unparse(data, f)`** | Write the data to the binary file linked to `f` |
| **`data = xmltodict.parse(f)`** | Read the data in the binary file linked to `f`. |

In [79]:
# %pip install xmltodict

In [80]:
import xmltodict

**Exercises**

Here is our metadata structure again:

In [81]:
mdata = {
    'metadata': {
        'height': 1080, 
        'width': 1920, 
        'order': 'RGB', 
        'date': '2024-12-24', 
        'subject': {
            'id': 'x134', 
            'name': 'Scratchy', 
            'sources': ['Cartoon', 'The Simpsons Lab, Springfield']
        }, 
        'researchers': ['Itchy', 'Bart', 'Lisa']
    }
} 
mdata

{'metadata': {'height': 1080,
  'width': 1920,
  'order': 'RGB',
  'date': '2024-12-24',
  'subject': {'id': 'x134',
   'name': 'Scratchy',
   'sources': ['Cartoon', 'The Simpsons Lab, Springfield']},
  'researchers': ['Itchy', 'Bart', 'Lisa']}}

Write this `mdata` data structure to `data.xml`.

Take a look at the text file.  Can you see how the data has been encoded in XML?

Read the `data.xml` file back into Python.  Did it read correctly?  (Note: the `xmltodict.parse()` function requires a "bytes" file, so use `'rb'` as the )

XML also allows for more-complex structures.  For example, the text below is valid xml:

In [82]:
text_xml = """
<root>
  <stimuli>
    <redleftc color="red" form="circle" side="left">Red Circle on the Left Side</redleftc>
    <redrights color="red" form="square" side="right">Red Square on the Right Side</redrights>
  </stimuli>
</root>
"""


Parse the `text_xml` string into a Python variable called `dset`, and get the side that the "redrights" stimuli appears on the screen:

Note: This "@" syntax is something special that the `xmltodict` library uses; it's not part of the XML syntax, and is just a way to make it easier to build a valid Python dict from the xml code.  This is part of the work that always happens when gluing two technologies together, and every library will have a different solution. 