# Adding Sequences to the Data Table

As per 2023-03-02_logbook.md I want to both add data files contained in sequence directories to the main data table when that is constructed AND add a column to the data table denoting which sequence the data came from, or "single run" if the data was obtained outside of a sequence.

It is tasks such as this which big for an OOP approach to handlnig the data, but that is still a little bit away as I'm not 100% on Python OOP.

First let's generate the data table.

In [None]:
%load_ext autoreload
%autoreload 2

import os
import sys

sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))

from pathlib import Path


import rainbow as rb


from scripts.core_scripts.data_interface import retrieve_uv_data

from scripts.core_scripts.data_interface import data_table

In [None]:
p = Path("/Users/jonathan/0_jono_data")

First gen the table.

In [None]:
data_table(p).head()

In [None]:
datadir_p = Path(
    "/Users/jonathan/0_jono_data/2023-02-23_2021-DEBORTOLI-CABERNET-MERLOT_AVANTOR.D"
)

datadir = rb.read(str(datadir_p))

In [None]:
datadir.__dict__

There doesnt seem to be a way to get the sample name specifically, so we should instead pull the sample name from the `SAMPLE.XML` field `<Name>`. At the same time we can pull the `<Description>` as well. Now, how to manipulate XML files in Python?

## Working with XML files in Python

### Getting Set Up

According to [this article](https://www.geeksforgeeks.org/reading-and-writing-xml-files-in-python/) BeautifulSoup and Elementtree can both be used to Parse XML files. ~~I think Elementtree is what the rainbow-api devs used~~ *Actually, they use etree from lxml*, however, BeautifulSoup is ubiquitous in the greater world, so I will use that.

### Parsing an XML File

When parsing an XML file, you first find a *tag* then extract from it:

In [None]:
from bs4 import BeautifulSoup

In [None]:
with open(datadir_p / r"SAMPLE.XML", encoding="utf8", errors="ignore") as f:
    data = f.read()

    print(data)

We've established that we can read the data. Now to use it.

In [None]:
Bs_data = BeautifulSoup(data, "xml")
Bs_data.prettify()

In [None]:
b_unique = Bs_data.find_all("unique")

b_name = Bs_data.find("Sample")

print(b_name)

Problem - due to the way its encoded, I don't think BeautifulSoup is parsing it correctly. According to [this stackoverflow post](https://stackoverflow.com/questions/17534932/how-to-verify-xml-encoding) I can check the encoding by reading the first eight bytes of the file. I can either do that with a HEX editor or through Python directly with the 'b' argument in `read()`:

In [None]:
with open(datadir_p / r"SAMPLE.XML", "rb") as f:
    data = f.read(8)

    print(data)

Doesn't provide what I was expecting.

On another tack, looking at the rainbow code [here](https://github.com/evanyeyeye/rainbow/blob/main/rainbow/agilent/chemstation.py), I can see that it does parse the xml files, but it appears to be looking for a AcqData directory which is not present in my data files. Perhaps Agilent changed how the files are structured.

Pivoting back to the XML problem, inspecting the files in the terminal with bat showed that the `SAMPLE.XML` file is encoded with UTF-16LE. Maybe try that as a setting?

In [None]:
with open(datadir_p / r"SAMPLE.XML", "r", encoding="UTF-16LE") as f:
    data = f.read()

    print(data)

That seems happier?

In [None]:
Bs_data = BeautifulSoup(data, "xml")
print(Bs_data.prettify())

Yes, that's working.

In [None]:
b_sample = Bs_data.find("Sample")

print(b_sample)

And to access the name?

In [None]:
name = b_sample.find("Name")
name

Ok, but how do I get that out as a string?

In [None]:
type(name)

try `str()`

In [None]:
str(name)

That doesn't work. What about [soup's documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)? It mentions a `get_text()` method

In [None]:
name_str = name.get_text()
name_str

In [None]:
type(name_str)

And there we go.

## OOP Data

Fuck it, let's do it.

In [None]:
data

In [None]:
from datetime import datetime

"""
a prototyped class definition using bits of rainbow and bits of direct XML parsing.

I'll keep building it from here as my use case increases, but it is barebones atm.

"""


class Data:
    def __init__(self, file_path):
        self.file_path = file_path
        self.name = self.load_meta_data()[0]
        self.description = self.load_meta_data()[1]
        self.rainbow = self.rb_object()
        self.uv_data = retrieve_uv_data(rb.read(str(self.file_path)))
        self.method = self.rainbow.datafiles[0].metadata["method"]
        self.acq_date = datetime.strptime(
            self.rainbow.metadata["date"], "%d-%b-%y, %H:%M:%S"
        )
        self.sequence_name = self.sequence_name()

    def sequence_name(self):
        if ".sequence" in self.file_path.parent.name:
            return self.file_path.parent.name
        else:
            return "single run"

    def load_meta_data(self):
        """
        atm this loads the name and description from SAMPLE.XML found in .D dirs.
        It also cleans the description string.

        Atm it needs to load the whole XML file to read these two tags, which seems inefficient
        but i dont know how to do it otherwise.
        """

        with open(self.file_path / r"SAMPLE.XML", "r", encoding="UTF-16LE") as f:
            xml_data = f.read()

            bsoup_xml = BeautifulSoup(xml_data, "xml")

            name = bsoup_xml.find("Name").get_text()

            description = bsoup_xml.find("Description").get_text()
            clean_description = description.replace("\n", "").replace(" ", "-").strip()

        return name, clean_description

    def rb_object(self):
        """
        loads the whole target data dir, currently it just returns the method and the data.
        """
        rainbow_obj = rb.read(str(self.file_path))

        return rainbow_obj


a_data_file = Data(datadir_p)

print(a_data_file.name)

In [None]:
a_data_file.sequence_name

In [None]:
seq_data_file = Data(
    Path(
        "/Users/jonathan/0_jono_data/2023-02-16_WINES_2023-02-16_13-46-32.sequence/001-0101.D"
    )
)
seq_data_file.sequence_name

In [None]:
a_data_file.method

In [None]:
a_data_file.uv_data[350].plot()

In [None]:
str(a_data_file.acq_date)

Now, a top level data dir class could be useful..?

In [None]:
# It will be able to return individual data files, all the single runs, and all the sequences.


class Top_Dir:
    def __init__(self, dir_path):
        self.path = dir_path
        self.single_runs = self.single_runs()
        self.sequences = self.sequences()

    def single_runs(self):
        single_run_list = []

        for obj in self.path.iterdir():
            if obj.name.endswith(".D"):
                try:
                    single_run_list.append(obj.name)

                except Exception as e:
                    print(f"{e}")

    def sequences(self):
        single_run_list = []

        for obj in self.path.iterdir():
            if obj.name.endswith(".sequence"):
                try:
                    single_run_list.append(obj.name)

                except Exception as e:
                    print(f"{e}")

        return single_run_list

In [None]:
top_dir = Top_Dir(p)

top_dir.sequences

I don't know yet how to handle the sequence data files. Whatever container I use for the data should simply 'unwrap' the sequence directories and store the single run and sequence .D at the same level, keeping track of the origin of the file - "sequence" or "single run".

The answer is to build a class heirarchy corresponding to the file structure. Lets call the application Agilette, a petit Agilent Chemstation imitator.

Agilette
 |  |
 |  |___top_dir
 |              |
 |              |__**sequences*
 |              |
 |              |__*single_runs*
 |
 |__data_table

In [None]:
class Agilette:
    def __init__(self, path=str):
        self.path = path

In [None]:
ag = Agilette("/Users/jonathan/0_jono_data/")
ag.path

I've reached a point where I should be refactoring this all into .py files. Ill do that now.