## 13 July 2018
-- Laurin Gray

This is a notebook to hold all of the functions I've written during this summer so that I don't have to look through all the notebooks to find a specific one.

These functions are used to access, manipulate, and plot data from the catalog of Spitzer sources of Khan et al. (2015), matched with sources from Whitelock et al. (2013) in CasJobs, in the process of trying to identify potential red candidates for AGBs/YSOs.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.stats import gaussian_kde
import csv
import pathlib

In [2]:
# Read in my data from a .csv file saved locally.

# all sources
phot_data = pd.read_csv('~/Documents/Phot_data/CMDparameters26June2018_lauringray.csv')

filter_phot_data = phot_data[(phot_data < 500.0) & (phot_data > -500.0)]

# 3-sigma red flagged data
flagged_data = pd.read_csv('/Users/lgray/Documents/Phot_data/flagged_vals_10July2018_lauringray.csv')

# CMD counts
CMD_counts = pd.read_csv('/Users/lgray/Documents/Phot_data/CMD_counting_11July2018_lauringray.csv')

### 3-Sigma Line Functions

These functions were written to identify points that are redward of a 3-sigma line on a CMD.  

Examples of usage are in the notebook 6July2018_LG_NGC6822_3SigLineFunct

They use the phot_data (and related filter_phot_data) tables.

In [3]:
def create_bins(bin_size, y1, y2, xval, yval, err):
    """
    Create bins to hold selected x-values, y-values, and errors, along with the coordinate ID, 
    depending on the range of y-values they fall into. Fill those bins with the values, and return the bins.
    
    The user enters the size of the bin they'd like, the range of data to cover, the x and y axes of the CMD,
    and the error associated with the x-axis (created above).  
    
    Note that it is possible to create bins that will have no values in them- these will simply hold a nan spot, 
    and not create a 3-sigma boundary for that range.
    
    The bin size and range must be chosen so that the number of bins comes out as a whole number.  
    If this is not the case, an error message will print.  
    
    y1 must be lower than y2.
    
    xval, yval, and err take the form "Kmag" or "e_kMINUSthreesix" 
    (created from renamed filter_phot_data lists i.e. Kmag = filter_phot_data.Kmag.values)
    
    Example of calling function:
        create_bins(0.5, 11.5, 18.5, kMINUSthreesix, Kmag, e_kMINUSthreesix)
    
    """
    
    n_bins = (y2 - y1)/bin_size
    #print(n_bins)

    if n_bins%1 == 0:
        n_bins = int(n_bins)
    else:
        print("Error: n_bins is not a whole number!  Choose a different range or bin size.")
    
    
    y_bins = [[] for x in range(0,n_bins)] # y-values
    x_bins = [[] for x in range(0,n_bins)] # x-values
    e_bins = [[] for x in range(0,n_bins)] # errors
    c_bins = [[] for x in range(0,n_bins)] # IDs

    #print(y_bins)

    c=0 #row counter
    for i in yval:
        k = 0 #bin counter
        y1 = 8.0
        while k < n_bins+1:
            if y1 <= i < y1+bin_size:
                y_bins[k].append(i)
                x_bins[k].append(xval[c])
                e_bins[k].append(err[c])
                c_bins[k].append(phot_data.ID.values[c])
                y1 = y1+bin_size
                k = k+1
            else:
                y1 = y1+bin_size
                k = k+1
        c = c+1
        
    return y_bins, x_bins, e_bins, c_bins

In [4]:
def vert_mean(mag_lim, xval, yval):
    """
    Determine the vertical line of the data- the average of the vertical branch.  
    To mitigate the effects of other branches, select a mag_lim that excludes where the points diverge.
    
    Note that this function WILL NOT work for data 
    
    Example of calling function:
        vert_mean(15.0, kMINUSthreesix, Kmag)
    """
    
    mean = []

    c=0
    for i in yval:
        if i < mag_lim:
            mean.append(xval[c])
            c = c+1
        else:
            c = c+1

    bound = np.nanmean(mean)
    stdev = np.nanstd(mean)

    left = bound - 3*stdev
    right = bound + 3*stdev

    clip = [] # sigma-clipped array
    for k in mean:
        if left < k < right:
            clip.append(k)

    boundary = np.nanmean(clip) #this is the average value of the points above the magnitude limit
    #print(boundary)
    
    return boundary

In [5]:
def bound_shift(boundary):
    """
    Create boundaries in region that are 3-sigma away from the vertical mean.  
    Input is the return of the vertical mean function.
    
    If any bins are empty, this function will return a RuntimeWarning: Mean of empty slice. 
    This is fine, it just holds a nan value in that spot and won't plot a boundary there
    
    Call example:
        bound_shift(boundary)
    """
    
    threesig = []
    for i in e_bin:
        threesig.append(np.nanmean(i)*3)
    
    # make list of red limit values
    redlim = []

    for i in threesig:
        redlim.append(boundary+i)
    
    #print(redlim)
    return redlim

In [6]:
def data_flag():
    """
    Evaluate and flag points that are to the right of the 3-sigma boundary.  
    IDs of flagged points are then stored in a list and returned.
    
    It is suggested that when you call the function to a variable, you name it in the format kVS_kMINUSthreesix,
    as this will make it easier to tell which datasets belong to which CMDs when they are all in the same file.
    
    Call example:
        kVS_kMINUSthreesix = data_flag()
    """
    
    IDs = [] #empty set to store IDs

    k=0
    #for i in x_bin[:75]:  # use instead of below statement if need to exclude below a certain point
    for i in x_bin:
        coord = c_bin[k]
        c=0
        for x in i:
            if x > redlim[k]:
                IDs.append(coord[c])
                c=c+1
            else: 
                c=c+1
        k = k+1
            
    print("Number of flagged points:", len(IDs))
    #print("IDs of points:", IDs)
    return IDs

In [7]:
def save_data(dataset, column=''):
    """
    Function for saving a single column of data at a time.
    
    Check if the data file already exists.  If it does, add the data on as a new column with a 
    header you set when you call the function.  If it doesn't, create the file and add the data to it.
    
    Call example:
        save_data(jVS_jMINUSthreesix, column='jVS_jMINUSthreesix')
    """
    
    if path.exists():
        flagged_points = pd.read_csv(filename)
        new_points = pd.DataFrame({column:dataset})

        flagged_points= pd.concat([flagged_points, new_points], axis=1)
        flagged_points.to_csv(filename, index=False)
    else:
        f = open(filename, 'w')
        writer = csv.writer(f)
        #add heading
        points_w_header = [column] + dataset

        for val in points_w_header:
            writer.writerow([val])

        f.close()

### CMD Counting

These are functions to count how many CMDs flagged points appeared in.  There is only one function, but it is run on however many CMDs you flagged points in.  When running on multiple columns, it may be useful to start with the longest column first, and work down to the shortest.  This will ensure that the most points are sorted on the first run, and the function will run faster each time.

This function was written after the CMDs had been counting, to make it easier in case we have to do it again.  CMD counting took place in the notebook 10July2018_LG_NGC6822_CMD_Crossref

This function uses flagged_data.

In [8]:
def CMD_count(col, column_list):
    """
    Before running the function, user defines empty lists of CMD counts as:
        in_one = []
        in_two = []
        in_three = []
        in_four = []
        in_five = []
        in_six = []
        in_seven = []
        in_eight = []
        in_nine = []
        in_ten = []
    
    This is so that the function can be run on multiple columns without erasing the previous lists.
    
    User chooses a CMD column that they want to use to iterate through the other CMDs (col), and defines a list
    including all other columns (column_list).
    
    For each element in the chosen column, the function goes through the list of columns, 
    and checks if the element is in a column.  Each time it is, +1 is added to a counter.  The element is then 
    sorted into a list based on the final value of the counter, after checking to make sure that 
    the element is not already in that list (so that it can be run on multiple columns).
    
    The function also prints which row of the chosen column the function is on every 100 rows, 
    which is useful for estimating progress.
    """
    
    k = 0 # row counter
    for i in col:
        k = k+1
        if k%100 == 0:
            print("On row", k) # print which row it's on every 100 rows, so I have an idea of the progress
    
        a=1 # CMD counter
        for c in column_list:
            s = set(c)
            if i in s:
                a = a+1
    
        if a == 1 and int(i) not in in_one:
            in_one.append(int(i))
        elif a == 2 and int(i) not in in_two:
            in_two.append(int(i))
        elif a == 3 and int(i) not in in_three:
            in_three.append(int(i))
        elif a == 4 and int(i) not in in_four:
            in_four.append(int(i))
        elif a == 5 and int(i) not in in_five:
            in_five.append(int(i))
        elif a == 6 and int(i) not in in_six:
            in_six.append(int(i))
        elif a == 7 and int(i) not in in_seven:
            in_seven.append(int(i))
        elif a == 8 and int(i) not in in_eight:
            in_eight.append(int(i))
        elif a == 9 and int(i) not in in_nine:
            in_nine.append(int(i))
        elif a == 10 and int(i) not in in_ten:
            in_ten.append(int(i))

### Plotting Red Candidates

These are functions for plotting where red candidates that appear in a minimum number of CMDs fall on the CMDs themselves.  

Examples of usage are in the notebook 12July2018_LG_NGC6822_RedCandPlot_Layers

They use phot_data (& filter_phot_data) and CMD_counts.

In [9]:
def corr_rows(groups, lengths):
    """
    Some of the ID numbers are wrong (ex. there are two 2118s), which means we can't use the ID to 
    directly access the row it belongs to. As we go further down, the problem gets worse.
    This function finds the correct rows in phot_data for each ID and saves them to a list.

    Takes a list of the CMD counts you want to include in the plot (group) (and their corresponding lengths) 
    and outputs a list of the rows in phot_data which correspond to the IDs in those groups.
    
    Note that because groups must be a list, even if you are just running one column, 
    you need to define the column and length in a list first (ex. groups = [in_ten]; corr_rows(groups, lengths)).
    
    Call example:
        group = [in_ten, in_nine, in_eight, in_seven, in_six, in_five]
        length = [152, 107, 177, 65, 101, 113]

        source_rows = corr_rows(group, length)
    """
    
    rows = phot_data.ID.values
    
    phot_rows = []
    
    #k = 0
    d = 0
    for j in groups:
        group_lim = lengths[d]
        k = 0
        #print(group_lim)
        for i in j:
            c = 0 # counter for phot_data rows, resets for each new element i
            # use a while loop so that it iterates until the end of the column
            while c < 30761 and k < group_lim: # to prevent reaching the end of the column and getting a nan error
                if int(i) != rows[c]: # check if i is equivalent to the ID in the current phot_data row
                    c = c+1 # if not, move to next row & go back to the top of the while loop
                else: # if i IS equivalent
                    phot_rows.append(c) # add the current row to corr_rows
                    c = 30761 # set c to stop iterating through the rest of the rows (end loop)
            k = k+1 # symbolically move onto the next element in in_ten (to stop the while loop at the end of in_ten)
        
        d = d+1
    
    return phot_rows

In [10]:
def xy_lookup(xaxis, yaxis, source_rows):
    """
    Takes the corrected row of a source in the CMD count, then uses it to look up 
    the x and y values in the phot_data table.
    source_rows should come from the output of the corr_rows function.
    
    This function is called "coord_lookup" in the 11July2018 notebook.
    
    Call example:
        x_flag, y_flag = coord_lookup(threesixMINUSeightzero, eightzero, source_rows)
    """
    
    x_vals = []
    y_vals = []
    k = 0 # row counter
    for i in source_rows:
        x_vals.append(xaxis[i])
        y_vals.append(yaxis[i])
    
    return x_vals, y_vals

In [11]:
def plot_red_layers(x_flags, y_flags):
    """
    User creates two lists containing all of the flagged x and y points that were separately saved. 
    The function then iterates through that list and scatterplots each set in a different color.
    
    This function also requires user to create a col_names list containing the names of each column included in the 
    final plot in string form.
    
    The user can choose whether to assign colors evenly based on the number of columns plotted, 
    or to keep the same colors with each column no matter how many there are.  
    To assign evenly, uncomment number = len(x_flags) and colors = [cmap(i) for i in np.linspace(0, 1, number)]
    """
    
    #number = len(x_flags)  # for even color assignment
    cmap = plt.get_cmap('gist_rainbow')
    colors = [cmap(i) for i in np.linspace(0, 1, 6)] # for consistent color assignment
    #colors = [cmap(i) for i in np.linspace(0, 1, number)]  # for even color assignment
    
    k = 0
    for i in x_flags:
        plt.scatter(x_flags[k], y_flags[k], c=colors[k], label=col_names[k], s=1)
        k = k+1
    plt.legend(loc='best')

In [12]:
def plot_CMD(xaxis, yaxis, x_flagged, y_flagged):
    """
    Plot a CMD, and overplot the values that are flagged in whichever CMD counts were listed in groups (for corr_rows).
    
    This function DOES have layering capabilities, simply flag points for each individual group separately
    and then create lists combining the flagged x & y points.
    
    Note that axes limits/labels are pre-defined for the 10 CMDs that were checked.
    """
    
    # set axis limits & names (so I don't have to do it manually each time I plot)
    if yaxis is eightzero:
        y1 = 8.0
        y2 = 18.0
        ylabel = '[8.0]'
    elif yaxis is Hmag:
        y1 = 11.5
        y2 = 19.0
        ylabel = 'H'
    elif yaxis is Kmag:
        y1 = 11.5
        y2 = 19.0
        ylabel = 'K'
    
    if xaxis is threesixMINUSeightzero:
        x1 = -2.0
        x2 = 7.5
        xlabel = '[3.6] - [8.0]'
    elif xaxis is fourfiveMINUSeightzero:
        x1 = -1.5
        x2 = 6.5
        xlabel = '[4.5] - [8.0]'
    elif xaxis is jMINUSh:
        x1 = -0.5
        x2 = 2.5
        xlabel = 'J - H'
    elif xaxis is hMINUSk:
        x1 = -0.5
        x2 = 2.0
        xlabel = 'H - K'
    elif xaxis is hMINUSthreesix:
        x1 = -1.5
        x2 = 4.0
        xlabel = 'H - [3.6]'
    elif xaxis is hMINUSfourfive:
        x1 = -2.5
        x2 = 5.0
        xlabel = 'H - [4.5]'
    elif xaxis is jMINUSk:
        x1 = -0.5
        x2 = 4.0
        xlabel = 'J - K'
    elif xaxis is kMINUSthreesix:
        x1 = -2.0
        x2 = 3.5
        xlabel = 'K - [3.6]'
    elif xaxis is kMINUSfourfive:
        x1 = -2.5
        x2 = 3.5
        xlabel = 'K - [4.5]'
    
    plt.figure(figsize=(10,10))
    plt.plot(xaxis,yaxis,',', color='grey')
    plt.xlim(x1, x2)
    plt.ylim(y2, y1)
    plt.xlabel(xlabel, size=12)
    plt.ylabel(ylabel, size=12)
    
    plot_red_layers(x_flagged, y_flagged)

### Tiered Catalog

These are functions for creating the tiered catalog of flagged sources, organized by CMD count.

Examples of usage are in the notebook 13July2018_LG_NGC6822_TieredCat

They use phot_data and CMD_counts.

In [13]:
def corr_rows(groups, lengths):
    """
    Some of the ID numbers are wrong (ex. there are two 2118s), which means we can't use the ID to 
    directly access the row it belongs to. As we go further down, the problem gets worse.
    This function finds the correct rows in phot_data for each ID and saves them to a list.

    Takes a list of the CMD counts you want to include in the plot (group) (and their corresponding lengths) 
    and outputs a list of the rows in phot_data which correspond to the IDs in those groups.
    
    Note that because groups must be a list, even if you are just running one column, 
    you need to define the column and length in a list first (ex. groups = [in_ten]; corr_rows(groups, lengths)).
    
    For the tiered catalogue, list the columns in order of most to least confidence.
    
    Call example:
        group = [in_ten, in_nine, in_eight, in_seven, in_six, in_five]
        length = [152, 107, 177, 65, 101, 113]

        source_rows = corr_rows(group, length)
    """
    
    rows = phot_data.ID.values
    
    phot_rows = []
    
    #k = 0
    d = 0
    for j in groups:
        group_lim = lengths[d]
        k = 0
        #print(group_lim)
        for i in j:
            c = 0 # counter for phot_data rows, resets for each new element i
            # use a while loop so that it iterates until the end of the column
            while c < 30761 and k < group_lim: # to prevent reaching the end of the column and getting a nan error
                if int(i) != rows[c]: # check if i is equivalent to the ID in the current phot_data row
                    c = c+1 # if not, move to next row & go back to the top of the while loop
                else: # if i IS equivalent
                    phot_rows.append(c) # add the current row to corr_rows
                    c = 30761 # set c to stop iterating through the rest of the rows (end loop)
            k = k+1 # symbolically move onto the next element in in_ten (to stop the while loop at the end of in_ten)
        
        d = d+1
    
    return phot_rows

In [14]:
def data_lookup(source_rows):
    """
    Takes the row of a source in the CMD count, then uses it to look up the related RA, Dec, & magnitudes
    in the phot_data table.  source_rows should come from the output of the corr_rows function.
    
    It is very similar to xy_lookup, but accesses all of the data associated with that ID instead of 
    just the desired x and y axes.
    
    This function is called "coord_lookup" in the 13July2018 TieredCat notebook.
    
    Call example:
        ID, RA, Dec, k36mag, k45mag, k58mag, k80mag, k24mag, Jmag, Hmag, Kmag, 
            jMINUSh, hMINUSk, jMINUSk = coord_lookup(source_rows)
    """
    
    ID = []
    RA = []
    Dec = []
    k36mag = []
    k45mag = []
    k58mag = []
    k80mag = []
    k24mag = []
    Jmag = []
    Hmag = []
    Kmag = []
    jMINUSh = []
    hMINUSk = []
    jMINUSk = []
    
    k = 0 # row counter
    for i in source_rows:
        ID.append(phot_data.ID.values[i])
        RA.append(phot_data.RA.values[i])
        Dec.append(phot_data.Dec.values[i])
        k36mag.append(phot_data.k36mag.values[i])
        k45mag.append(phot_data.k45mag.values[i])
        k58mag.append(phot_data.k58mag.values[i])
        k80mag.append(phot_data.k80mag.values[i])
        k24mag.append(phot_data.k24mag.values[i])
        Jmag.append(phot_data.Jmag.values[i])
        Hmag.append(phot_data.Hmag.values[i])
        Kmag.append(phot_data.Kmag.values[i])
        jMINUSh.append(phot_data.jMINUSh.values[i])
        hMINUSk.append(phot_data.hMINUSk.values[i])
        jMINUSk.append(phot_data.jMINUSk.values[i])
    
    return ID, RA, Dec, k36mag, k45mag, k58mag, k80mag, k24mag, Jmag, Hmag, Kmag, jMINUSh, hMINUSk, jMINUSk

In [15]:
def save_cat(filename):
    """
    Saves produced tiered catalogue of red candidates to a csv file.  
    Note that the file MUST NOT previously exist or else this function will just 
    add the new columns to the previously existing file.
    
    Call example:
        filename = '/Users/lgray/Documents/Phot_data/RedCandTiers_13July2018_lauringray.csv'
        save_cat(filename)
    """
    
    f = open(filename, 'w')
    writer = csv.writer(f)
    #add heading
    points_w_header = ['ID'] + ID

    for val in points_w_header:
        writer.writerow([val])

    f.close()

    # list of other columns
    cols = [RA, Dec, k36mag, k45mag, k58mag, k80mag, k24mag, Jmag, Hmag, Kmag, jMINUSh, hMINUSk, jMINUSk]
    headers = ['RA', 'Dec', 'k36mag', 'k45mag', 'k58mag', 'k80mag', 'k24mag', 'Jmag', 'Hmag',
           'Kmag', 'jMINUSh', 'hMINUSk', 'jMINUSk']

    c=0
    for i in cols:
        data = pd.read_csv(filename)
        new_col = pd.DataFrame({headers[c]:i})
        c = c+1

        data= pd.concat([data, new_col], axis=1)
        data.to_csv(filename, index=False)

Because I put all the data from data_lookup into one file, and csv formats won't save any grouping I do in Excel, individual tiers can be accessed by indexing the rows containing them.

The indexes are as follows:

- in_ten: 0 to 151
- in_nine: 152 to 258
- in_eight: 259 to 435
- in_seven: 436 to 500
- in_six: 501 to 601
- in_five: 602 to 714