This notebook contains various examples of how to read in tabulated data in Python.

Date Created: Fall 2016
<br>
Last Modified: Feb 5 2017 
<br>
Humans Responsible: The Prickly Pythons

In [None]:
%matplotlib inline

# 1. Read data files in different formats in Python

## 1.0 Starting with ASCII files!
See links below for more information: 
<br>
https://docs.python.org/3/howto/unicode.html
<br>
https://stackoverflow.com/questions/2241348/what-is-unicode-utf-8-utf-16

In [None]:
chr(65)

In [None]:
chr(0b01000001)

In [None]:
chr(0x41)

In [None]:
ord('A')

In [None]:
aphrase = [0x4A, 0x65, 0x67, 0x20, 0x6C, \
           0xE6, 0x72, 0x65, 0x72, 0x20, \
           0x50, 0x79, 0x74, 0x68, 0x6F, \
           0x6E, 0x2E]

for i in range(len(aphrase)):
    print(chr(aphrase[i]))

In [None]:
ord('æ')

In [None]:
chr(230)

In [None]:
chr(0x1F603)

## 1.1 Reading data in using numpy.loadtxt()
Docs: http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html

In [None]:
# In test_data/ there is a text file called spectrum.dat 
# with data that we want to load into python. 
# (spectrum.dat is a model stellar spectrum from starburst99 for 
# a group of stars with 0.7 x solar metallicity, 
# 1e4 solar masses population, Kroupa IMF and a starburst 1e6 years ago).

In [None]:
import numpy as np
spec_nparray = np.loadtxt('test_data/spectrum.dat', skiprows=6)

print(type(spec_nparray))

In [None]:
# Shape of this numpy array will be determined by number of columns and rows in your data:
print(spec_nparray.shape)

In [None]:
# And if you want to extract e.g. the column containing wavelength data, 
# you need to remember its column index, in this case 1:
wavelength_A = spec_nparray[:,1]

print(wavelength_A)

In [None]:
# By default, numbers are loaded with float 64 bit precision: 
print(wavelength_A.dtype)

## 1.2 Reading data in using numpy.genfromtxt()
Doc: http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html

The `genfromtxt()` function from numpy is a bit more flexible.

In [None]:
spec_nparray2 = np.genfromtxt('test_data/spectrum.dat', skip_header=6, \
                              names=['t_yr','wavelength_A','L_tot','L_stellar','L_nebular'])
print(type(spec_nparray2))

In [None]:
# Print the seventh value in the wavelength column.
print(spec_nparray[7,1])
print(spec_nparray2['wavelength_A'][7])

In [None]:
# Try to change one of the wavelengths into something that is not a number (like %%%) 
# and you will see that genfromtxt() can handle this if you specify the keywords:
# missing_values='%%%', filling_values=desired_value
spec_nparray2 = np.genfromtxt('test_data/spectrum_nan.dat', skip_header=6,\
                              names=['t_yr','wavelength_A','L_tot','L_stellar','L_nebular'],\
                              missing_values='%%%', filling_values=np.nan)

print(spec_nparray2['wavelength_A'][0])

In [None]:
# But loadtxt() will crash:
spec_nparray = np.loadtxt('test_data/spectrum_nan.dat', skiprows=6)

## 1.3 Read data into Pandas dataframe
Typically, a smarter way (if you are essentially loading a matrix) is to load the data directly into a Pandas DataFrame. The function read_table (almost identical to read_table) can be used to read an ascii file into a dataframe:<br>
http://pandas.pydata.org/pandas-docs/stable/dsintro.html<br>
Some attractive functionalities in Pandas that can be applied to a DataFrame:<br>
http://dataconomy.com/2015/03/14-best-python-pandas-features/

In [None]:
import pandas as pd

names=['t_yr','wavelength_A','L_tot','L_stellar','L_nebular']
spec_dataframe = pd.read_table('test_data/spectrum.dat', \
                               names=names,              \
                               skiprows=6,               \
                               sep=r"\s*",               \
                               engine='python')    
print(type(spec_dataframe))

Note about engine: The C engine is faster while the python engine is currently more feature-complete. The C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. 

In [None]:
spec_dataframe['t_yr'][1]

In [None]:
# Plot spectrum
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['xtick.labelsize'] = 15
mpl.rcParams['ytick.labelsize'] = 15

fig          =   plt.figure(0, figsize=(10,5))
ax1          =   fig.add_axes([0.15,0.1,0.75,0.8])
ax1.set_ylim(31,39)
ax1.set_xlim(1e2,1e6)
ax1.set_xscale('log')
ax1.set_xlabel('Wavelength [AA]', fontsize=15)
ax1.set_ylabel('log flux [erg/s/AA]', fontsize=15)
ax1.set_title('1e4 M$_{\odot}$ of stars with Z=0.008 after 1e6 yr', fontsize=15)#+str(t1)+' yr')
#ax1.plot(spec_nparray[:,1],spec_nparray[:,2],'b')
#ax1.plot(spec_nparray2['wavelength'],spec_nparray2['L_tot'],'b')
ax1.plot(spec_dataframe['wavelength_A'], spec_dataframe['L_tot'],'b')

plt.show()

Pandas has a function that saves (serializes) a dataframe:

In [None]:
spec_dataframe.to_pickle('test_data/spec_dataframe_pickle') # no extension
load_spec_dataframe_pickle = pd.read_pickle('test_data/spec_dataframe_pickle')
load_spec_dataframe_pickle['t_yr'][0] # test

By default, the to_pickle function will use the highest "protocol" possible to save the dataframe in binary format: 

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_pickle.html

Protocol version 0 is the original ASCII protocol and is backwards compatible with earlier versions of Python.<br>
Protocol version 1 is the old binary format which is also compatible with earlier versions of Python.<br>
Protocol version 2 was introduced in Python 2.3. It provides much more efficient pickling of new-style classes.

To check what your default highest protocol is:

In [None]:
import pickle as pickle
pickle.HIGHEST_PROTOCOL

So here protocol=2 was used. Forcing protocol=0, results in a slightly larger datafile:

In [None]:
import pickle as pickle
spec_dataframe.to_pickle('test_data/spec_dataframe_pickle_p0',protocol=0) # no extension

Ultimately it depends a bit on what the dataframe contains, see a comparison here of different ways to save dataframes: 
http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization

## 1.4 `Fits` files from astropy

In test_data/ there is a file called cloud.fits with data that we want to import into python. 
(cloud.fits is a simulated HCO+ data cube of a cloud, calculated with the radiative transfer code LIME)

First we need the fits module from astropy.io to open the fits file as a class object:

In [None]:
from astropy.io import fits

fits_file = fits.open('test_data/cloud.fits')
print(type(fits_file))

(HDU = Header Data Unit)<br>
Next, we can get some basic info about the fits file:

In [None]:
fits_file.info()

And display all header "cards":

In [None]:
print(fits_file[0].header) 

Now we can extract the info that we're interested in like this:

In [None]:
imgres = fits_file[0].header['CDELT2']
print('Image resolution: %.6s degrees ' % imgres)
npix = fits_file[0].header['NAXIS3']
print('Number of pixels on each side: %s' % npix)
velres = fits_file[0].header['CDELT3']
print('Velocity resolution: %s m/s' % velres)

And we can change any of these parameters:

In [None]:
fits_file[0].header['CDELT2']=2.0
imgres = fits_file[0].header['CDELT2']
print('Image resolution: %.6s degrees ' % imgres)

The actual data is an attribute of data[0]:

In [None]:
HCO_flux = fits_file[0].data 
print(np.shape(HCO_flux))

In this case, I know that the 60x100x100 matrix is in the format [velocity channels, x axis, y axis], so we can create the moment 0 map as:

In [None]:
mom0 = HCO_flux.sum(axis=0)*velres/1000 # moment 0 map, Jy*km/s

And make a contour plot of this map:

In [None]:
import matplotlib.cm as cm
import matplotlib.pyplot as plt

fig         =   plt.figure(1,figsize=(9,9))
ax1         =   fig.add_subplot(1,1,1)
ax1.set_xlabel("x ['']",fontsize=15)
ax1.set_ylabel("y ['']",fontsize=15)
ax1.set_title("Moment 0 map of HCO$^+$ gas cloud",fontsize=15)
x1 = imgres*(np.arange(npix)-npix/2) # image axis
xmax = max(x1)
im = ax1.imshow(mom0,interpolation='bilinear',origin='lower',\
                cmap=cm.hot,extent=(-xmax,xmax,-xmax,xmax),vmax=120)
# Add colorbar that matches image in height
from mpl_toolkits.axes_grid1 import make_axes_locatable
divider = make_axes_locatable(ax1)
cax = divider.append_axes("right", size="5%", pad=0.05)
cbar = plt.colorbar(im,cax=cax)
cbar.set_label('Jy km/s',size=20)
plt.show(block=False)

# 2. Saving data for later with numpy
Docs: 
<br>
https://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
<br>
https://docs.scipy.org/doc/numpy/reference/generated/numpy.load.html

In [None]:
# Say you have a numpy array that you want to save to a file and load later. 
# One way to do so is with numpy:
np.save('test_data/spec_nparray', spec_nparray) # will get a '.npy' extension

In [None]:
# Test - using numpy.load()
load_spec_nparray = np.load('test_data/spec_nparray.npy')
load_spec_nparray[7,1]

# 3. Pickling
Docs: 
<br>
https://docs.python.org/3/library/pickle.html
<br>
https://docs.python.org/2.3/lib/module-cPickle.html

In [None]:
# You can also use pickle! Or cPickle, which is pickle written in C, 
# with several advantages.
import cPickle as pickle

pickle.dump(spec_nparray, open('test_data/spec_nparray_pickle','wb')) # no extension
# 'wb' is the protocol and means to write to binary format
load_spec_nparray = pickle.load(open('test_data/spec_nparray_pickle','rb'))
load_spec_nparray[7,1] # Test

In [None]:
# But the to_pickle attribute is specific to pandas and will not work on say a numpy array:
spec_nparray.to_pickle('test_data/spec_dataframe_pickle')