# ARROW Python Activity 5.2 and 5.3 Hints, Tips, and Code Snippets


This notebook contains hints, tips and code snippets that you might find useful in completing Activities 5.2 and 5.3

Generally, many scientific data processing tasks will require a very similar workflow. Something like:

1. Read in some data.
2. Process the data.
3. Optionally, Display the data.
4. Write out the data - usually to a new file.

> **OPTIONAL:** For those of you who are more confident in Python coding, you could embed code that performs all the workflow steps inside a loop. You could then work out how to provide the program with a list of all your spectrum file names and process all of them in one go. If you do attempt this, don't use _Matplotlib_ to display anything. Use _Bokeh_ instead. For a number of technical reasons, _Matplotlib_ doesn't display multiple, sequential plots very well in Jupyter notebooks.

> **HINT:** We'll want to preserve the original header information from our file so that we can add it to our output file after we have processed the non-header data. You'll need to read in the spectrum header lines (there's 12 of them - all starting with '#') using ordinary _Python_ File I/O. Later, you'll use ordinary File I/O again to write the saved header to a new file then **append** the data in the modified _Pandas_ `DataFrame` as comma delimited data. Use the _Pandas_ **`.to_csv()`** method with `mode='a'`. 
> You'll want to remember this procedure for later on during this Topic.

> **HINT:** Use _Pandas_ to read in, process and write out the main spectral data.

> **HINT:** You can display the spectrum using either _Matplotlib_ or _Bokeh_. _Matplotlib_ may seem easier to make quick plots, but _Bokeh_ will be much more useful later on.

> **HINT:** Review the `UsingPandas`, `UsingBokeh` and `FileIO` notebooks before you start


## Step 0 - Imports and Functions

Import the modules and packages you'll need. You should be pretty confident with this now. We'll do the first obvious one, but the rest are up to you!


In [1]:
import pandas as pd
# Any others you might need

Now write any functions you might find useful later. This isn't strictly necessary, but it is good programming proceedure and will prove especially useful if you put the simple code into a loop later.

To get you started, we'll provide you with a function that performs the slightly awkward two-part process of reading in the data file

1. Reading (and saving for later) the header lines.
2. Reading the actual data. 

You could, of course just do this in a block at the start of the code without defining a reuseable function.

> We strongly recommend that you do at least try and understand what the function code is doing!


In [2]:
def read_ARROW_data(filename):
    """Reads in and partially processes an  ARROW spectrum
    
    The spectrum file contains a number of header lines indicated by `#' or blanks. 
    This function splits these from the main data and returns both 
            
    Parameters
    ----------
    filename : str
        Name of the spectrum file
    
    Returns
    -------
    dat : class: pandas.DataFrame
        Spectrum data
    Header lines : list of str
        List of header lines
    """
    
    # Read lines till first line not starting with #, or whitespace.
    # Store these as a list
    header_list=[]
    number_header_lines=0
    dat=None
    with open(filename) as f:
        line = f.readline()
        while line[0] == '#' or line[0] == ',' or line[0].isspace():
            header_list.append(line)
            number_header_lines += 1
            line = f.readline()
        dat = pd.read_csv(filename, header=number_header_lines)

    return dat, header_list

Next, let's write a function to convert from observed frequency to radial velocity. 

> **Note:** You'll need to complete the code in the next cell to make it work properly. Just need to fill add code to perform the actual conversion where instructed.

> **NOTE:** If you pass a _Pandas_ `Series` (or _NumPy_ `array`) to the function it will operate on all elements in the input data structure.

In [None]:
# Function to convert frequency to radial velocity. Normally expect this to be pleaced at the
# start of the program

def freq_to_vel(freq, f0=1420.4e6):
    ''' Takes a frequency value (or Pandas Dataframe column or Series) and returns
    a velocity value (or new Dataframe column of values). f0 is the rest
    frequency and defaults to 1420.4 MHz'''
    
    # We need a value for 'c' - speed of light. Either just do it here or, neatly, use the 
    # astropy 'constants'
    c = 299792458.0  #m/s
    
    #
    #v = # DO YOUR CALCULATION HERE - probably use km/s for convenience 
    #
    
    return v  #(km/s)                          

### Step 1 - Get the data

With our helper functions defined, we're now ready to start writing the actual analysis program.

First we'll use our function to read in the header lines and the main data.

> **Note:** You'll need to complete the code in the next cell to make it work properly.

In [3]:
# Prompt the user for a file name (we'll call it file_name)
# You should know how to do this by now
file_name = #?????????

spectrum_df, header_lines = read_ARROW_data(# What goes in here?)

# Display the first few lines - does it look reasonable?
spectrum_df.head(4)


Unnamed: 0,frequency,intensity
0,-800000,2.322
1,-795000,2.363
2,-790000,2.439
3,-785000,2.446


### Step 2 - Baseline Removal

This activity uses the "_off-source_" spectra you should have collected.

You'll need to read these in, find an average of the `intensity` column (the values in the `frequency` column will be the same as those in the main spectrum files, so you don't need to modify them).

The analysis will require the following steps:

1. Read in the separate background files using _Pandas_. We can use the same function as before, but you can ignore the header lines completely.
2. Compute the average of the `intensity` columns. Take advantage of the fact that, as with a _NumPy_ 1D `array`, you can sum the corresponding elements of several of _Pandas_ `Series` (or `DataFrame` columns) by just using the `+` operator. Similarly, you can divide all elements in a column by a number by just using `/`. See Section 5.3 (_NumPy Arrays_), in the "_Python - What you need to know_" resource.
3. Subtract this average off-source `intensity` from the main spectrum's `intensity` column.

To read in the data, you could use our data reading function or just skip the header lines using the _Pandas_ **`read_csv()`** function.

Below is a very rudimentary way of reading several separate files this using "_hard coded_" file names. You should be able to make this more flexible by providing a list of file names (which could generated by reading the names from a text file you supply) and iterating or looping over this list.


In [None]:
# Read in the background spectra, average and subtract from the spectrum
# Here we use  'hard-wired' file names but you could use a file list, or manually enter them
number_header_lines = 12

bg1 = pd.read_csv('bg1.csv', header=number_header_lines)
bg2 = pd.read_csv('bg2.csv', header=number_header_lines)
bg3 = pd.read_csv('bg3.csv', header=number_header_lines)

# Compute average 'intensity' values
bg_av = (bg1['intensity']+bg2['intensity']+bg3['intensity'])/3
print(type(bg_av))
# Subract from spectrum 'intensity'
spectrum_df['intensity'] = spectrum_df['intensity']-bg_av.values

spectrum_df.head(4)


If you did want to try a more efficient data loading approach, you should probably implement steps similar to the following:
1. Prepare a text file (with a simple text editor - NOT a word processor) containing a list of background files - one file name per line. 
2. After opening the text file, use the **`.read().splitlines()`** method demonstrated in the section 2.2 of the `FileIO` notebook to produce a Python list of these file names.
3. Now produce a _Python_ list of the actual data from each of these files. Here's a code snippet that would perform steps 2 and 3:

```python
li=[]
    for f in bg_files:
        df = pd.read_csv(f, header=12)
        li.append(df)
```

4. Now you have a list contaning data you can produce an average set of data by using the **`sum()`** function and then dividing by the number of files - which, of course, is the length of the file list you've produced.

```python
bg_av = sum(li)/len(li)
```
5. Finally you can subtract the `intensity` values of this from the `spectrum_df` `intensity` data as explained above.

```python
spectrum_df['intensity'] = spectrum_df['intensity']-bg_av['intensity'].values
```

### Step 2 - Process the data

Thankfully we've defined our **`freq_to_vel()`** function, so this step is now pretty trivial.

After processing the data, there's another step you'll need to perform to add the newly computed velocity column to the existing `DataFrame`. This modification isn't necessary if you're going to be using straightforward _Python_ File IO to write the data, but doing so makes writing the file later using _Pandas_ pretty easy.

In [None]:
# Convert frequency to radial velocity values using this function
spectrum_v = freq_to_vel(spectrum_df['frequency'])

# Add a new 'velocity' column with these values
spectrum_df['velocity'] = spectrum_v

### Step 3 - Display the processed data

You can use _Matplotlib_ or _Bokeh_. If you need help, consult the `UsingMatplotlib` and `UsingBokeh` notebooks for more information.

Don't forget you need to extract the _x_ values and _y_ values from your data to pass to _Matplotlib_ or _Bokeh_. Section 2 of the `UsingBokeh` notebook should be helpful here.

In [None]:
# Display code goes here

### Step 4 - Finally, write the modified data out to a file.

We'll leave most of this section for you to complete. Your code should perform the following operations.

1. Prompt the user for a new file name.
2. Use the **`.writelines()`** function described in section 2.1 of the `FileIO` notebook to write the saved header lines to this file.
3. Now **append** the modified _Pandas_ data using the _Pandas_ **`.to_csv()`** method that is illustrated in section 3 of the `UsingPandas` notebook. Don't forget to **append** the data or you will overwrite the header lines.

> **Note:** You'll need to complete the code in the next cell to make it work properly.

In [None]:
# Prompt for a new file name
new_file_name = #########?

# First write the header lines that we read in earlier to the file.
# Use the .writelines() function from FileIO section 2.1

    
# Now APPEND the modified csv data using the pandas .to_csv() method 
# UsingPandas section 3 should help