# Python Packages
This jupyter notebook continues from the previous, *intro_to_python.ipynb*.

# 3. Plotting
By far the most common package used for plotting data is matplotlib (portmanteau of: MATLAB-plot-library).

In particular, you will use the functionality contained in the pyplot classes:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

To plot a line, simply call the plot() function

In [None]:
x = np.arange(0, 2*np.pi, 0.01)
y = np.sin(x)

plt.plot(x, y)

You can have multiple datasets on the same plot

In [None]:
y_cos = np.cos(x)

plt.plot(x, y)
plt.plot(x, y_cos)

Colours can be chosen too

In [None]:
plt.plot(x, y, c='k')
plt.plot(x, y_cos, c='b')

If you want to set up multiple figures, you can do this by defining a figure before plotting

In [None]:
plt.figure()
plt.plot(x, y)

plt.figure()
plt.plot(x, y_cos)

You can also refer to a previous figure if you know the figure number

In [None]:
y_tan = np.tan(x)

plt.figure(1)
plt.plot(x, y)

plt.figure(2)
plt.plot(x, y_tan)

plt.figure(1)
plt.plot(x, y_cos)

Limit the plot visually:

In [None]:
plt.figure()
plt.plot(x, y_tan)
plt.xlim(0, 3)
plt.ylim(-10, 10)

The same works for scatter plots too

In [None]:
plt.figure()
plt.scatter(x, y_tan, s=1) # pass s to change the scatter size
plt.xlim(0, 3)
plt.ylim(-10, 10)

Matplotlib allows for subfigures within a single figure

In [None]:
fig, ax = plt.subplots(1, 2, sharey=True) # number of rows, number of columns
ax[0].plot(x, y)
ax[1].plot(x, y_cos)

In [None]:
fig, ax = plt.subplots(2, 2, sharex=True, sharey=True)
ax[0, 0].plot(x, y) # indexing multiple dimensions
ax[1, 0].plot(x, y_cos)
ax[0, 1].plot(x, y_cos)
ax[1, 1].plot(x, y)

## Seaborn:
Seaborn is a very similar plotting package to matplotlib, but it simplifies some complex functions. I recommend reading the docs and learning it, but everything seaborn does is possible in base matplotlib if you try hard enough.

# 4. Opening FITS files
The best ways to open FITS files are using either astropy (very useful for other functions too!) or fitsIO. I prefer fitsIO because it's faster and more configurable but both work very well.

Download GALAH DR4 [here](https://cloud.datacentral.org.au/teamdata/GALAH/public/GALAH_DR4/catalogs/) (make sure to get galah_dr4_allstar_240705.fits), and put it in the folder containing this Jupyter Notebook.
(If clicking the link on datacentral doesnt work, try right-clicking -> save as)

In [None]:
import fitsio

# We will open GALAH DR4 - this is a big-ish file
galah_dr4_file = 'galah_dr4_allstar_240705.fits'
hdul = fitsio.FITS(galah_dr4_file)
print(hdul)

We can index the FITS file using either standard indices [0, 1] etc. as in the extnum, or using string indexing from the hduname

In [None]:
print(hdul[0])
print(hdul[1])

We can read in the data using .read()

If the data is a binary table, we can specify rows and columns in case the fits file is very large and we dont want to use all our RAM

In [None]:
data_full = hdul[1].read()
data_small = hdul[1].read(rows=[1,2,3,100], columns=['teff', 'logg', 'fe_h']) # note these are lists, 
                                    #and not indexes (it will not accept index notation such as [1:23])
print(len(data_full))
print(data_full)
print(len(data_small))
print(data_small) # Note the data is read in as a numpy array

hdul.close() # close the file once we're done with it

For more examples on how to use fitsIO, see the [fitsIO github](https://github.com/esheldon/fitsio)

Good practice when opening files is to use a *with* block, to ensure the data is handled safely and we dont end up storing too much information in memory

In [None]:
with fitsio.FITS(galah_dr4_file) as hdul:
    data_full = hdul[1].read()
    print(hdul)
# leaving the with block will safely close the file after storing whatever we want in memory
# IF WE TRY TO RUN THIS: print(hdul) IT WILL CRASH AS "hdul" DOESNT EXIST ANY MORE
print(data_full)

# 5. Data manipulation

## Pandas
Pandas DataFrames are a simple way to do advanced manipulation of data - personally I prefer pandas over numpy arrays but sometimes just using arrays is better. This becomes more obvious when manipulating large sets of data and converting to pandas and vice-versa becomes slow.

We can make a DataFrame from a dictionary, or from other structures like numpy arrays

In [None]:
import pandas as pd
galah_df = pd.DataFrame(data_full)
print(galah_df)

We can probe the dataframe easily, for example checking the column names

In [None]:
print(galah_df.columns)
for i in galah_df.columns:
    print(i)

We can check whats in a column by indexing with the column name

In [None]:
fe_h = galah_df['fe_h']
print(fe_h)

We can't just index with numbers by default

In [None]:
print(galah_df[3])

Instead, we can use the .loc (location) and .iloc (index location) functions

In [None]:
print(galah_df.loc[1:3, 'fe_h']) # notice how this is inclusive because it doesn't use pythonic indexing
                                # - the rows have their own ids that we're indexing that are ints by default
print(galah_df.iloc[1:3, 34]) # but this does have pythonic indexing

One of the most useful features that pandas dataframes handles well is masking

In [None]:
# First, lets plot a kiel diagram of the entire galah dataset
plt.figure()
plt.scatter(galah_df['teff'], galah_df['logg'], c=galah_df['fe_h'], s=0.5, vmin=-4, vmax=0.5) # notice we can colour by an array too, and set the colourbar limits using vmin and vmax
plt.ylim(-1, 5)
plt.xlim(3000, 8000)
plt.gca().invert_xaxis()
plt.gca().invert_yaxis()
plt.colorbar()

What if we only care about some metal poor stars, for example?

In [None]:
galah_metal_poor = galah_df[galah_df['fe_h'] < -2] # Here, we are indexing the dataframe using a boolean array
# This line is equivalent to
mask = galah_df['fe_h'] < -2
galah_metal_poor = galah_df[mask]

In [None]:
# Plot again
plt.figure()
plt.scatter(galah_metal_poor['teff'], galah_metal_poor['logg'], c=galah_metal_poor['fe_h'], s=0.5, vmin = -4, vmax=0.5) # notice we can colour by an array too
plt.ylim(-1, 5)
plt.xlim(3000, 8000)
plt.gca().invert_xaxis()
plt.gca().invert_yaxis()
plt.colorbar()

This way of masking will reduce the entire DataFrame, and preserve its order - which makes it very useful for filtering by one condition and checking another (like in our figure)

We can also have multiple conditions

In [None]:
mask = (galah_df['fe_h'] < -1) & (galah_df['logg'] < 3.5) # We use "&" for "and" and "|" for "or"
# if you have multiple conditions they also need to be seperated by brackets
galah_cut = galah_df[mask]

# Plot again
plt.figure()
plt.scatter(galah_cut['teff'], galah_cut['logg'], c=galah_cut['fe_h'], s=0.5, vmin = -4, vmax=0.5) # notice we can colour by an array too
plt.ylim(-1, 5)
plt.xlim(3000, 8000)
plt.gca().invert_xaxis()
plt.gca().invert_yaxis()
plt.colorbar()

This should cover the basics and allow you to start playing around with data in python!

For any more specific use cases, check the library or package documentation online.