# File Input and Output (I/O)

In this section, we'll discuss several of the common methods for getting the types of dataset used by astronomers into Python for analysis, and how to save our own generated (or modified) data back out. This is a core skill, because it is usually the first step in any research endeavour that is not a pure theory-based analytic calculation. 

As an example: let's say you've found a research mentor --- graduate student, postdoc, or faculty --- and they have agreed to help you get started with research. Generally, this will involve two things: 
- They will provide a (possibly overwhelming) list of papers and review articles for you to read, and 
- They will provide some data related to the project. 

Then they might say something like "As you skim through these papers to get a feel for what you will be doing, try playing around with these data a bit and see if you can make a plot of X vs Y." Or, they might say "Go online and download the Nasa Sloan Atlas and see if you can make a BPT diagram for all the galaxies with non-zero line flux."


This is where file I/O comes in. Even if you have taken a general introductory Python course online or even in the computer science department of a university, those courses will likely never have had cause to teach you how to read in a `FITS` file, a specialized filetype which can only be interacted with via special tools. Even if your mentor provides something as simple as a text file with some columns of numbers written out (known as `ASCII`), it is likely that a function you may have learned such as \code{numpy.loadtxt()} or \code{pandas.read_csv()} won't work on that data right "out of the box" because astronomical data in ASCII files often 

- has missing entries
- have a mix of delimeters 
- are not even formatted into clean columns. 

If that sounds frustrating, it is --- and it's why more constrained data formats (like `FITS`, `asdf`, or `hdf5`) are now preffered. Generally, if you can successfully *write* one of these files, adhering to their internal rules, you can *read* them back in simply and without hassle.

In the following examples, I am going to demonstrate the *general* method for reading several common data formats in astronomy. This can be treated as a "cheat-sheet" in the data formats if you already know a bit of Python. If you are brand new to Python, I suggest skipping ahead to basic Python syntax and structure first, so that you can separate what is a Python-ism and what is data format related. 

## ASCII (text files)

The simplest (in some sense) format for data storage is the humble text file. The primary advantage of an ASCII file as a data storage format is portability; no special software is needed to read these files (in fact, any computer can do so via one of several programs, including the shell). On many computers, you can preview the contents of a text file without even opening it. 

Text files are also very flexible insofar is they impose no rules on how you format the information placed into a text file. As mentioned above, however, this can lead to headaches later, because decisions that make a text file read-friendly to the eye might make it impossible to parse (easily) via computer. 

ASCII files are so basic that even Python itself has an I/O operation that can read them in. Here is an example of that for one of the data sets you will use in the Chapter Assignment on astronomical imaging (*Note, this is almost never the recommended method*):

In [1]:
with open('../Imaging/relano2016_m33_apertures.txt','r') as f:
    data = f.read()

Above, I use a python-structure called a *context manager* (the "with open" formatting) simply because it allows a file to be opened, extracted, and then automatically closed again when the requisite data has been loaded. Equivalent would be:

In [2]:
f = open('../Imaging/relano2016_m33_apertures.txt','r') 
data = f.read() 
f.close()

Let's look at what our `data` variable looks like:

In [3]:
data

'ID,  RA,            DEC,           ap_radius, type\n1\t,01:32:32.330,+30:35:03.97,48.5,Mixed\n2\t,01:32:34.685,+30:30:27.45,40.7,Mixed\n3\t,01:32:34.687,+30:27:29.01,44.8,Mixed\n4\t,01:32:37.566,+30:40:08.76,53.2,ClearShell\n5\t,01:32:44.823,+30:34:58.75,29.5,Filled\n6\t,01:32:44.903,+30:25:10.88,60.0,ClearShell\n7\t,01:32:45.500,+30:38:55.18,47.6,Mixed\n8\t,01:32:46.135,+30:20:25.64,30.3,ClearShell\n9\t,01:32:51.918,+30:29:54.64,45.4,ClearShell\n10\t,01:32:52.458,+30:34:57.00,32.5,Shell\n11\t,01:32:52.578,+30:38:17.46,19.5,ClearShell\n12\t,01:32:53.934,+30:23:18.05,36.1,Mixed\n13\t,01:32:56.280,+30:40:34.37,39.4,Shell\n14\t,01:32:56.655,+30:27:25.20,37.4,Shell\n15\t,01:32:58.381,+30:35:58.24,28.5,Shell\n16\t,01:32:59.261,+30:44:20.36,38.3,Mixed\n17\t,01:33:00.968,+30:30:53.08,39.4,Shell\n18\t,01:33:03.079,+30:11:22.52,40.1,Mixed\n19\t,01:33:07.361,+30:42:40.20,34.7,Mixed\n20\t,01:33:08.571,+30:29:54.02,34.3,Mixed\n21\t,01:33:10.283,+30:27:24.12,41.1,ClearShell\n22\t,01:33:11.174,+30:

In [4]:
type(data)

str

As we can see, it's read a file in as a single, long string. That's not very helpful. Try opening the Relano file using a program on your computer (like Notepad on a PC, or with Preview on a Mac). Notice that our string here contains things that don't seem to be in the file! Things like "\n2" or "\t". These are *formatting* tags, and they are the structure formatting for ASCII. For example, "\t" indicates there is a tab-spacing between two entries, and "\n" indicates a new line in the file. 

While we could use those formatting tags to construct a set of code that parses this string into actual data we care about, it is instead better to use one of several libraries that have functions directly related to this purpose. 

The `numpy` library has two functions useful for this: `np.loadtxt` and `np.genfromtxt`. If you have a very simple text file with columns of numbers and no mix of data types, `np.loadtxt` is probably fine. But in many cases, `np.genfromtxt` is more flexible. Let's try it out:

In [5]:
import numpy as np 
data = np.genfromtxt('../Imaging/relano2016_m33_apertures.txt')

ValueError: Some errors were detected !
    Line #2 (got 2 columns instead of 5)
    Line #3 (got 2 columns instead of 5)
    Line #4 (got 2 columns instead of 5)
    Line #5 (got 2 columns instead of 5)
    Line #6 (got 2 columns instead of 5)
    Line #7 (got 2 columns instead of 5)
    Line #8 (got 2 columns instead of 5)
    Line #9 (got 2 columns instead of 5)
    Line #10 (got 2 columns instead of 5)
    Line #11 (got 2 columns instead of 5)
    Line #12 (got 2 columns instead of 5)
    Line #13 (got 2 columns instead of 5)
    Line #14 (got 2 columns instead of 5)
    Line #15 (got 2 columns instead of 5)
    Line #16 (got 2 columns instead of 5)
    Line #17 (got 2 columns instead of 5)
    Line #18 (got 2 columns instead of 5)
    Line #19 (got 2 columns instead of 5)
    Line #20 (got 2 columns instead of 5)
    Line #21 (got 2 columns instead of 5)
    Line #22 (got 2 columns instead of 5)
    Line #23 (got 2 columns instead of 5)
    Line #24 (got 2 columns instead of 5)
    Line #25 (got 2 columns instead of 5)
    Line #26 (got 2 columns instead of 5)
    Line #27 (got 2 columns instead of 5)
    Line #28 (got 2 columns instead of 5)
    Line #29 (got 2 columns instead of 5)
    Line #30 (got 2 columns instead of 5)
    Line #31 (got 2 columns instead of 5)
    Line #32 (got 2 columns instead of 5)
    Line #33 (got 2 columns instead of 5)
    Line #34 (got 2 columns instead of 5)
    Line #35 (got 2 columns instead of 5)
    Line #36 (got 2 columns instead of 5)
    Line #37 (got 2 columns instead of 5)
    Line #38 (got 2 columns instead of 5)
    Line #39 (got 2 columns instead of 5)
    Line #40 (got 2 columns instead of 5)
    Line #41 (got 2 columns instead of 5)
    Line #42 (got 2 columns instead of 5)
    Line #43 (got 2 columns instead of 5)
    Line #44 (got 2 columns instead of 5)
    Line #45 (got 2 columns instead of 5)
    Line #46 (got 2 columns instead of 5)
    Line #47 (got 2 columns instead of 5)
    Line #48 (got 2 columns instead of 5)
    Line #49 (got 2 columns instead of 5)
    Line #50 (got 2 columns instead of 5)
    Line #51 (got 2 columns instead of 5)
    Line #52 (got 2 columns instead of 5)
    Line #53 (got 2 columns instead of 5)
    Line #54 (got 2 columns instead of 5)
    Line #55 (got 2 columns instead of 5)
    Line #56 (got 2 columns instead of 5)
    Line #57 (got 2 columns instead of 5)
    Line #58 (got 2 columns instead of 5)
    Line #59 (got 2 columns instead of 5)
    Line #60 (got 2 columns instead of 5)
    Line #61 (got 2 columns instead of 5)
    Line #62 (got 2 columns instead of 5)
    Line #63 (got 2 columns instead of 5)
    Line #64 (got 2 columns instead of 5)
    Line #65 (got 2 columns instead of 5)
    Line #66 (got 2 columns instead of 5)
    Line #67 (got 2 columns instead of 5)
    Line #68 (got 2 columns instead of 5)
    Line #69 (got 2 columns instead of 5)
    Line #70 (got 2 columns instead of 5)
    Line #71 (got 2 columns instead of 5)
    Line #72 (got 2 columns instead of 5)
    Line #73 (got 2 columns instead of 5)
    Line #74 (got 2 columns instead of 5)
    Line #75 (got 2 columns instead of 5)
    Line #76 (got 2 columns instead of 5)
    Line #77 (got 2 columns instead of 5)
    Line #78 (got 2 columns instead of 5)
    Line #79 (got 2 columns instead of 5)
    Line #80 (got 2 columns instead of 5)
    Line #81 (got 2 columns instead of 5)
    Line #82 (got 2 columns instead of 5)
    Line #83 (got 2 columns instead of 5)
    Line #84 (got 2 columns instead of 5)
    Line #85 (got 2 columns instead of 5)
    Line #86 (got 2 columns instead of 5)
    Line #87 (got 2 columns instead of 5)
    Line #88 (got 2 columns instead of 5)
    Line #89 (got 2 columns instead of 5)
    Line #90 (got 2 columns instead of 5)
    Line #91 (got 2 columns instead of 5)
    Line #92 (got 2 columns instead of 5)
    Line #93 (got 2 columns instead of 5)
    Line #94 (got 2 columns instead of 5)
    Line #95 (got 2 columns instead of 5)
    Line #96 (got 2 columns instead of 5)
    Line #97 (got 2 columns instead of 5)
    Line #98 (got 2 columns instead of 5)
    Line #99 (got 2 columns instead of 5)
    Line #100 (got 2 columns instead of 5)
    Line #101 (got 2 columns instead of 5)
    Line #102 (got 2 columns instead of 5)
    Line #103 (got 2 columns instead of 5)
    Line #104 (got 2 columns instead of 5)
    Line #105 (got 2 columns instead of 5)
    Line #106 (got 2 columns instead of 5)
    Line #107 (got 2 columns instead of 5)
    Line #108 (got 2 columns instead of 5)
    Line #109 (got 2 columns instead of 5)
    Line #110 (got 2 columns instead of 5)
    Line #111 (got 2 columns instead of 5)
    Line #112 (got 2 columns instead of 5)
    Line #113 (got 2 columns instead of 5)
    Line #114 (got 2 columns instead of 5)
    Line #115 (got 2 columns instead of 5)
    Line #116 (got 2 columns instead of 5)
    Line #117 (got 2 columns instead of 5)
    Line #118 (got 2 columns instead of 5)
    Line #119 (got 2 columns instead of 5)
    Line #120 (got 2 columns instead of 5)

Ouch! We got a bunch of errors! This is because `np.genfromtxt` doesn't know the format of the file and doesn't have a strong engine for making guesses. We have to instead provide several extra arguments to the function to tell it out our data looks:

In [10]:
data = np.genfromtxt('../Imaging/relano2016_m33_apertures.txt',
                     skip_header=1,
                     delimiter=',')

If you look directly at the text file we're working with, you'll notice that the first row contains names for each column --- we want to skip this row when reading in the data because it will (usually) be a different data type (str) than the data in the column below it. Additionally, we had to specify that the delimeter between entries in a row is a comma. But now we should have a reasonable array of data:

In [11]:
data[:5]

array([[ 1. ,  nan,  nan, 48.5,  nan],
       [ 2. ,  nan,  nan, 40.7,  nan],
       [ 3. ,  nan,  nan, 44.8,  nan],
       [ 4. ,  nan,  nan, 53.2,  nan],
       [ 5. ,  nan,  nan, 29.5,  nan]])

Oh dear.... most of the data has been read in as a `NaN` --- "Not a Number". The reason for this is that we have a mix of data types in our file -- the RA, DEC, and "type" columns are all technically strings, and only the `ap_radius` column is a decimal valued number (known as a float), which is what numpy expects by default. 

As it turns out, we can also specify the data types for each individual column upon loading the file:

In [19]:
data = np.genfromtxt('../Imaging/relano2016_m33_apertures.txt',
                     skip_header=1,
                     delimiter=',',
                     dtype=(int,'<U25','<U25',float,'<U25'))
data[:5]

array([(1, '01:32:32.330', '+30:35:03.97', 48.5, 'Mixed'),
       (2, '01:32:34.685', '+30:30:27.45', 40.7, 'Mixed'),
       (3, '01:32:34.687', '+30:27:29.01', 44.8, 'Mixed'),
       (4, '01:32:37.566', '+30:40:08.76', 53.2, 'ClearShell'),
       (5, '01:32:44.823', '+30:34:58.75', 29.5, 'Filled')],
      dtype=[('f0', '<i8'), ('f1', '<U25'), ('f2', '<U25'), ('f3', '<f8'), ('f4', '<U25')])

Horray! We finally got the data into Python (in at least what appears to be a sensible way). Above, I specified that columns 1 and 4 were `int` and `float` while the others is a way of saying "A string less than 25 characters". 

Are we finished? Unfortunately, no. The `numpy` library is not well-designed to handle this type of heterogeneous data, and is currently storing it as a structured array, for which we cannot easily access columns or perform operations (like choosing just rows with radii greater than some value).

We could loop through this array and ultimately place each column into a different list, then start indexing. But it turns out there was (all along) a best tool for the job: `pandas`. 

The pandas library will be covered more later, but in short, it is a data-science oriented library with tools well-suited to the (often) heterogeneous datasets seen in industry. It also has some logic built in to try to figure out the structure of a dataset. 

In [29]:
import pandas as pd 
data = pd.read_csv('../Imaging/relano2016_m33_apertures.txt')
data[:5]

Unnamed: 0,ID,RA,DEC,ap_radius,type
0,1,01:32:32.330,+30:35:03.97,48.5,Mixed
1,2,01:32:34.685,+30:30:27.45,40.7,Mixed
2,3,01:32:34.687,+30:27:29.01,44.8,Mixed
3,4,01:32:37.566,+30:40:08.76,53.2,ClearShell
4,5,01:32:44.823,+30:34:58.75,29.5,Filled


As we can see, with almost no effort at all, `pandas` read our data into a table, deftly figuring out the column headings and parsing the heterogeneous data types. We can now get the radii out easily, or indeed, find all rows with a radius bigger than some value. There is only one thing that pandas did *incorrectly* which we do have to fix: Our delimeter in our file is a comma, but there are many spaces in the header row designed, it seems, to make it easier to see which column heading goes with each column. `pandas` didn't know to strip those spaces, so if we actually print the column names, we'll see the following:

In [31]:
data.columns

Index(['ID', '  RA', '            DEC', '           ap_radius', ' type'], dtype='object')

We need to remove these spaces or we won't be able to easily index out the columns. 

In [32]:
data = pd.read_csv('../Imaging/relano2016_m33_apertures.txt',
                   skipinitialspace=True)
data.columns

Index(['ID', 'RA', 'DEC', 'ap_radius', 'type'], dtype='object')

The `pandas.read_csv()` function has *many* optional arguments that lets us set up for different data. Here, I've used the "skipinitialspace" argument to tell it to strip out those leading spaces. Now, with this (still relatively easy) line of code, we can perform our data analysis:

In [36]:
data.loc[data.ap_radius>55]

Unnamed: 0,ID,RA,DEC,ap_radius,type
5,6,01:32:44.903,+30:25:10.88,60.0,ClearShell
27,28,01:33:15.673,+30:56:40.94,60.0,Mixed
28,29,01:33:15.870,+30:53:24.88,67.5,Mixed
32,33,01:33:23.026,+30:50:23.31,60.0,Mixed
39,40,01:33:28.248,+30:52:49.29,60.0,Shell
47,48,01:33:35.110,+31:00:54.44,60.0,ClearShell
55,56,01:33:44.499,+31:02:04.51,60.0,Shell
82,83,01:34:10.505,+30:21:52.11,59.2,ClearShell
86,87,01:34:13.301,+31:09:14.58,60.0,ClearShell
97,98,01:34:33.060,+30:47:01.71,74.2,Mixed


Above, for example, I've *filtered* the DataFrame for only rows where the aperture radius is greater than 55 arcseconds. 


### What have we learned?

When it comes to ASCII text files, this is only the beginning. These read-text functions have numerous extra optional parameters because many text files are in fact much more complicated and headache-inducing than this one. What I want you to takeaway is: 

- Sometimes, one tool/reader will be better suited to your data than another and can help reduce the time it takes to get data imported,
- Indeed, sometimes using one tool to *load* the data, then converting it to another data container of choice (say, from `pandas` to \code{numpy}) can be a good trick, 
- Sometimes, but not always, simply modifying the underlying text file in a notepad-like program first can make it easier to read (for example, adding a comma delimeter).

## Want practice with Pandas?

Try out this lab: https://astro-330.github.io/Lab6/Lab6.html


## FITS (Flexible Image Transport System) 

Thankfully, from here on out, things are going to get simpler. We are moving to data formats where in order to *write* a file in the first place, it must follow rules that make it *readable*. Astronomical images and data tables are very often stored in what are known as "fits files". It's a stable standard that has been around for a long time. Reading fits files is relatively simple, as the \code{astropy} library has a module dedicated to the purpose:

In [None]:
from astropy.io import fits
with fits.open('../../BookDatasets/imaging/m33-halpha.fits') as hdu: 
    image = hdu[0].data 
    header = hdu[0].header

I've used the same context manager format as we did above, but instead of Python's `open()`, I've used the `fits.open()` function in astropy. Let's talk briefly about how `FITS` files are organized. 

Internally, a `FITS` file is organized into *extensions*. You can think of these as different containers in which different things can be stored. Many fits files only use one extension (say, to store a single image). But a `FITS` file containing a whole survey catalog may utilize multiple extensions to store different tables, such as one for each field of the survey. 

When we read a `FITS` file into an object, it is iterable (or list-like), meaning we access the different extensions by indexing (above I only use the 0th index for the first extension). Each extension (container) has two attributes: data, and a header. The data is, as it sounds, whatever data we have stored. Generally, it is either image-like (as in, an array of numbers) or tabular (there is a specific internal formatting to handle column names, data types, etc). The header, meanwhile, can be thought of as the metadata --- it usually contains information about whatever is stored in the data attribute (sometimes critical information for parsing that data). For example, image data usually have headers with information like the exposure time, telescope name, RA and DEC where the telescope was pointing, what filter was used, etc. For our Python purposes, a header gets read into the dictionary format, with keys and values. 

You might be wondering why I chose the name `hdu` when reading in the file. As each extension contains a header and some data, the standard name for one is "header data unit," or hdu. Above, we could've used any name. But knowing what HDU means is useful because if we were to *write* a `FITS` file, we'd need to do something like the following:

In [None]:
save_array = np.zeros((10,10))

hdu = fits.PrimaryHDU(save_array,header={'IMSHAPE':'10 x 10'})
hdu.writeto('some_name.fits')

As we can see, the way to make an hdu to save a new `FITS` file is to call `fits.PrimaryHDU()`, to which we passed the data array and a dictionary, which will be put in the header. And if we need multiple extensions, we would create a `fits.HDUList()`, containing a `fits.PrimaryHDU()` and possibly some additional `fits.ImageHDU()`'s. 

For non-tabular `FITS` files, the `image` variable above will just be a `numpy` array, and the `header` will just be a dictionary. For a much deeper dive into what to do from here if working with an image, see the workbook on Astronomical Imaging.

How about if tabular data were stored?

By convention, tabular data begins being stored in the 1st, rather than 0th extension. Tables are thus stored after the image, or if there is no image, the 0th extension is empty. Tabular data, if parsed directly using the format above will be a special `FITS` record array. Let's as an example load the Nasa Sloan Atlas, which you'll be using later in the text. 

In [41]:
with fits.open('../../BookDatasets/catalogs/nsa_v0_1_2.fits') as hdu:
    table = hdu[1].data

Unfortunately, there's no easy way to display this table in a view-friendly format. Here's a truncated peak:

In [43]:
table[0]

('J094651.40-010228.5', '09h/m00/J094651.40-010228.5', 146.71420878660933, -1.0412815695754145, 0, 72212, 21157, -1, -1, -1, 15.178774, 0.021222278, 'sdss', 0.07, 756, 1, 206, '301', 136.29353, 1095.152, 0.020597626, 0.020687785, 0.00044536332, 0, array([  29.552263,   53.198177,  175.86322 ,  819.325   , 1793.9138  ,
       2480.876   , 3251.03    ], dtype=float32), array([3.0069932e-01, 1.9650683e-01, 1.4631173e-02, 4.3475279e-03,
       9.1012771e-04, 4.7585298e-04, 1.2297294e-04], dtype=float32), 1, array([  31.202734,   49.386097,  199.0066  ,  824.0366  , 1712.252   ,
       2462.11    , 3454.6162  ], dtype=float32), array([-15.1673565, -15.816648 , -17.190884 , -18.824028 , -19.66704  ,
       -20.00378  , -20.296247 ], dtype=float32), array([ 222.77438,  471.76144,  383.8668 , 2475.7463 , 2484.602  ,
       2484.4727 , 1102.5615 ], dtype=float32), array([0.4536473 , 0.44762787, 0.2820931 , 0.20756142, 0.15054086,
       0.11415057, 0.08093417], dtype=float32), array([-0.0060749

A lot of this issue here is that this table has many columns and rows. A better way to load a table that has been stored in a `FITS` file is via the `astropy` Table object:

In [44]:
from astropy.table import Table 
table = Table.read('../../BookDatasets/catalogs/nsa_v0_1_2.fits')

In [46]:
table[:2]

IAUNAME,SUBDIR,RA,DEC,ISDSS,INED,ISIXDF,IALFALFA,IZCAT,ITWODF,MAG,Z,ZSRC,SIZE,RUN,CAMCOL,FIELD,RERUN,XPOS,YPOS,ZLG,ZDIST,ZDIST_ERR,NSAID,NMGY,NMGY_IVAR,OK,RNMGY,ABSMAG,AMIVAR,EXTINCTION,KCORRECT,KCOEFF,MTOL,B300,B1000,METS,MASS,XCEN,YCEN,NPROF,PROFMEAN,PROFMEAN_IVAR,QSTOKES,USTOKES,BASTOKES,PHISTOKES,PETROFLUX,PETROFLUX_IVAR,FIBERFLUX,FIBERFLUX_IVAR,BA50,PHI50,BA90,PHI90,SERSICFLUX,SERSICFLUX_IVAR,SERSIC_N,SERSIC_BA,SERSIC_PHI,ASYMMETRY,CLUMPY,DFLAGS,AID,PID,DVERSION,PROFTHETA,PETROTHETA,PETROTH50,PETROTH90,SERSIC_TH50,OBJNO,PLATE,FIBERID,MJD,COEFF,VDISP,D4000,D4000ERR,FA,FAERR,S2FLUX,S2FLUXERR,S2EW,S2EWERR,S2VMEAS,S2VMERR,S2RATIO,HAFLUX,HAFLUXERR,HAEW,HAEWERR,HAVMEAS,HAVMERR,N2FLUX,N2FLUXERR,N2EW,N2EWERR,N2VMEAS,N2VMERR,HBFLUX,HBFLUXERR,HBEW,HBEWERR,HBVMEAS,HBVMERR,O1FLUX,O1FLUXERR,O1EW,O1EWERR,O1VMEAS,O1VMERR,O2FLUX,O2FLUXERR,O2EW,O2EWERR,O2VMEAS,O2VMERR,O3FLUX,O3FLUXERR,O3EW,O3EWERR,O3VMEAS,O3VMERR,AHGEW,AHGEWERR,AHDEW,AHDEWERR,NE3EW,NE3EWERR,NE5EW,NE5EWERR,AV,S2NSAMP,RACAT,DECCAT,ZSDSSLINE,SURVEY,PROGRAMNAME,PLATEQUALITY,TILE,PLUG_RA,PLUG_DEC
bytes19,bytes27,float64,float64,int32,int32,int32,int32,int32,int32,float32,float32,bytes7,float32,int16,uint8,int16,bytes3,float32,float32,float32,float32,float32,int32,float32[7],float32[7],int16,float32[7],float32[7],float32[7],float32[7],float32[7],float32[5],float32[7],float32,float32,float32,float32,float64,float64,float32[7],"float32[15,7]","float32[15,7]","float32[15,7]","float32[15,7]","float32[15,7]","float32[15,7]",float32[7],float32[7],float32[7],float32[7],float32,float32,float32,float32,float32[7],float32[7],float32,float32,float32,float32[7],float32[7],int32[7],int32,int32,bytes8,float32[15],float32,float32,float32,float32,int32,int32,int32,int32,float32[7],float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float32,float64,float64,float32,bytes6,bytes23,bytes8,int32,float64,float64
J094651.40-010228.5,09h/m00/J094651.40-010228.5,146.71420878660933,-1.0412815695754143,0,72212,21157,-1,-1,-1,15.178774,0.021222278,sdss,0.07,756,1,206,301,136.29353,1095.152,0.020597626,0.020687785,0.00044536332,0,29.552263 .. 3251.03,0.30069932 .. 0.00012297294,1,31.202734 .. 3454.6162,-15.1673565 .. -20.296247,222.77438 .. 1102.5615,0.4536473 .. 0.080934174,-0.006074993 .. 0.019238615,0.00017146798 .. 2.5712407e-12,0.00020710815 .. 0.99210894,2.7482898e-05,0.30352896,0.034775626,8815395000.0,203.3387756347656,266.452880859375,0.3149959 .. 22.908411,0.3149959 .. 0.0,233.37904 .. 0.0,0.06221572 .. -0.057742715,0.00061461644 .. -0.13205947,0.88285136 .. 0.7480506,0.28299746 .. -56.808624,18.20807 .. 2266.657,1.9946122 .. 0.015632946,1.0260136 .. 563.82434,47.339676 .. 0.3339262,0.8845921,15.40155,0.8016074,17.53331,19.459507 .. 3130.742,1.0776075 .. 0.01723353,4.7782097,0.6645174,16.040314,-0.035117492 .. 0.007748138,0.05221656 .. 0.059254237,0 .. 0,0,30,v2_1_5,0.22341923 .. 258.39,7.2478933,3.4641922,10.453795,5.845999,26600130,266,1,51630,4.1058846 .. 1.12e-43,131.86357,1.6750895,0.01734476,0.086800694,0.0124226315,269.2588,17.046213,2.474215,0.23574898,123.141365,7.7632475,0.68473977,584.5531,13.858368,5.2918415,0.2084552,141.25987,5.8739424,294.05807,13.240504,2.6620486,0.16082057,128.46378,9.59668,147.91853,15.076868,1.5249467,0.12158316,122.40449,14.56363,22.092758,12.022614,0.20646292,0.13632269,103.73506,30.22204,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,112.33277,12.699974,1.195764,0.17837653,125.590454,9.01761,4.145629,0.39556292,-0.8991989,0.43246344,0.009395278,0.5594772,-9999.0,-9999.0,0.96117455,35.6447,146.7141910489815,-1.0412763864732957,0.021222278,sdss,legacy,good,122,146.71421,-1.0413043
J094631.60-005917.7,09h/m00/J094631.60-005917.7,146.63173520920705,-0.9883548584122686,1,72132,-1,-1,-1,-1,18.12335,0.05265425,sdss,0.07,756,1,206,301,618.0553,344.6873,0.052030597,0.052030597,0.0005003469,1,5.937189 .. 117.10686,2.7116714 .. 0.04883515,1,5.818742 .. 117.17861,-15.458004 .. -18.714317,81.08679 .. 568.131,0.43238893 .. 0.07714152,-0.025642741 .. -0.0068317885,3.529478e-11 .. 4.7585988e-07,3.0668558e-05 .. 0.8125395,0.023327291,0.06884753,0.020664403,1583589900.0,79.9375,89.643310546875,0.15526375 .. 0.8952112,0.15526375 .. 0.0,454.9598 .. 0.0,-0.006554106 .. -0.11334504,0.00026806968 .. -0.032187585,0.9869664 .. 0.789186,88.828926 .. -82.073296,3.443978 .. 137.51616,8.568999 .. 0.018935598,0.45930192 .. 31.892025,115.095764 .. 0.68445855,0.7651253,161.06845,0.59412754,162.74481,3.986808 .. 113.16854,6.9103274 .. 0.107008755,0.91950214,0.45471346,164.1058,0.018470645 .. -0.0010265708,0.32469216 .. -0.016755164,0 .. 0,0,36,v2_1_5,0.22341923 .. 258.39,4.7698913,2.2897646,5.2029705,2.8038216,26600630,266,6,51630,1.1e-44 .. 0.0008093229,41.803913,1.2417903,0.03935879,0.4228338,0.06334799,79.63688,4.6382565,10.780762,1.0136558,53.696716,3.4870718,0.84784895,158.23763,4.2205896,21.083307,1.1031191,51.533344,2.0600708,33.259502,3.634611,4.4314384,0.6746483,45.62845,6.644905,47.451176,6.3951616,4.9146857,0.5737447,56.25411,8.379909,7.674392,3.2292523,0.97784686,0.50509477,44.255222,28.346481,111.57335,13.896584,15.15305,3.610634,45.99403,9.92913,40.89481,5.2624846,4.2584043,0.76661664,25.959152,6.670042,-1.3966367,1.3962288,-3.96047,1.4181765,3.0781476,1.7761708,-9999.0,-9999.0,0.46189722,7.586127,146.6316733254659,-0.9882606225727444,0.05265425,sdss,legacy,good,122,146.63167,-0.98827781


As we can see, the `Table` read method has parsed that record array and created something similar to the `pandas` display we saw above; column headers with names, and a cleaner interface for viewing the data. We can now filter, if needed. Let'a find all sources within a narrow redshift ($z$) range.

In [50]:
filtered = table[(table['Z']>0.02)&(table['Z']<0.021)]
len(filtered)

2070

Don't worry about the syntax of that filtering yet, if you're unfamiliar, but an english translation might be "set the variable `filtered` to the table, indexed for rows such that the column `table['Z']` has a value greater than 0.02 and also such that the column `table['Z']` has a value less than 0.021." 

You might be wondering when to use `pandas` vs `astropy.Table`s to work with your data --- in truth, you can convert between them fairly easily, so whichever you are more comfortable with will ultimately be fine.

:::{note}
One caveat to this is that `pandas` cannot handle "cells" with multi-dimensional values, e.g., a row-column position can't have a value of a whole array. Some astronomical tables, *because* `astropy.Tables` can handle this, will store data that way. NSA is an example. Often, however, we only need a few columns, so whichever method we use to read the data in, we can extract those columns, create our own table, and move on.
:::

## ASDF (Advanced Science Data Format)

One new and up-and-coming data storage option is the `asdf` format. It is convenient because it stores data in a tree-like structure that, from the outside, looks exactly like a Python dictionary. This means that when we want to take some data we've been working with and save it to a file, instead of trying to format everything into columns and writing an ascii file, we can simply store things by column name in a dictionary and then drop it straight into the file:

In [None]:
import asdf
from asdf import AsdfFile
import numpy as np

tree = {
    'a': np.arange(0, 10),
    'b': np.arange(10, 20)
}

target = AsdfFile(tree)
target.write_to('target.asdf')

As we can see, any arrays we are working with we can name, drop into the tree dictionary, and ultimately write to the file. We can also add metadata -- either directly into the tree, or into a metadata dictionary which then gets added to the tree. Opening an `asdf` file is also simple:

In [None]:
with asdf.open("example.asdf") as af:
    tree = af.tree

This \code{tree} attribute contains the dictionary we saved above, so everything pops right back out. 

There's a lot more to `asdf` in terms of motivation and capability, but for our purposes at present, the important part is knowing how to pack and save, then open and unpack, data with this format. 

## HDF5 (Hierarchical Data Format 5)

The `hdf5` format is another new, semi-popular file standard. It structures its interior like a full file directory system, with "folders" and "files". To deal with `hdf5` files in Python, we need to use the `h5py` package, which handles the I/O operation for us. 

In [None]:
import h5py

f = h5py.File('example.hdf5', 'r') 
data_keys = list(f.keys())

From here, things will change depending on the file --- HDF5 stores things in "datasets," whose names will appear in that list of keys we just accessed. Assuming our example file key list had "dataset1" as one of the keys, we would then have

In [None]:
dataset1 = f['dataset1']

This dataset would be an object similar to, but not exactly, a `numpy` array. This object can be indexed and sliced like a `numpy` array. For many frameworks, however, it is beneficial to extract and convert these datasets into `numpy` before continuing further.

:::{tip}
More on datasets can be found here: https://docs.h5py.org/en/stable/high/dataset.html#dataset
:::

