# Seminar 3: File IO, Pandas and Plotting

In seminar 2 we covered the basics of conditional statements and how to implement them into your code. We also covered how to make your own functions and covered the difference between local and global variables and how to use plain english to jot down the steps needed to achieve an objective so that you can then translate the plain english plan into code later. 

Quick review of Seminar 2 concepts:

If-Else Statements
    
    if condition1:
        Execute Code for Condition 1 Here
    elif condition2:
        Execute Code for Condition 2 Here
    else:
        Execute Code if neither condition is met
        
For-Loops:
    
    #this code iterates over the values of iterable 
    for variable_name in iterable:
        CODE FOR FOR-LOOP
    
    #this for-loop loops over the indices of the array iterable
    for idx in range(len(iterable)):
        CODE FOR FOR-LOOP
    
Custom Functions
    
    def FUNCTION_NAME(ARGUMENTS (if needed):
        '''
        DOCSTRINGS GOES HERE
        '''
        
        CODE FOR FUNCTION
        
        return RETURN_VARIABLE (if needed)
        
Flowcharts and Pseudocode

    What is the Code Objective? (ie: What do I want my code to do?)
    What information do I have available to me? (ie: some files, maybe an equation(s))
    What are some coding tools I have to achieve the Code Objective?(if-else, File IO, plotting, custom function)
    
    Answering the above questions will be crucial for your flowchart and pseudocode as 
    this will help guide you into what needs to be done and then allow you to translate the written 
    statements into coding syntax
    
   
Now we will pivot into a very important section of coding for research and this is learning how to read in files. In research we are always dealing with files and sending out files to our advisors or collaborators. So knowing how to open a file and grab the relavent data you will need is extremely important for your success as a researcher. We will cover how to open basic text files and csv files in this notebook, in Seminar 4 we will cover how to open and access FITS files. 

In [None]:
import numpy as np
import pandas as pd

# Reading in Text Files

One of the most common files to send and store information are text files. They are easy to open, make and luckily for us there are plenty of function that we can use to access the contents of text files. 

Python has a built in function that can open all kinds of files. It is super versatile and when used well can be an amazing tool to open even the most complicated file format. The python function to read in files is the $\textbf{open}$ function. We will cover the basic use of it and how it can be used for reading and generating new files. The syntax for reading in a file using open is as follows:

    file = open('PATH/TO/FILE/FILENAME.txt', 'r')
    

The 'r' is important here as it tell python that we want to read the file only and do not want to write anything to the file. So the variable file will only have things to do with reading in the file. 

The variable file is now a textwrapper and contains within it the entire entries of the file. We isolate each row of the file by looping over the contents of file using a $\textbf{For-Loop}$ or using one of the access function that file has.

# Example 1: Reading in the file Galaxy_Coordinates.txt 

In [None]:
#code to read in the file
file = open('Galaxy_Coordinates.txt', 'r')

In [None]:
#let's see what functions we have available to us
file.

In [None]:
#code that grabs all the lines in the file and stores it to the variable lines
lines = file.readlines()

In [None]:
lines

In [None]:
#always a good practice to close the file so that it does not take up memory in the jupyter notebook
file.close()

In [None]:
#code that loops through all the lines in the file line-by-line
for row in lines:
    print(f'Row: {row} Type: {type(row)} --- Length of the Row: {len(row)}')
    print()

# A quick guide on string splitting

In [None]:
lines

In [None]:
row1 = lines[0]
row1.split()

In [None]:
test_filename = 'Unique_Filename_Number'
test_filename.split('_')

In [None]:
test_path = 'Usr/Desktop/Python_BootCamp/Seminar3/File.txt'
test_path.split('/')

In [None]:
#Making empty arrays to later store data from the file here
col1 = np.array([])
col2 = np.array([])
col3 = np.array([])
col4 = np.array([])

#looping over all the rows except the first as the first row contains the column names
for row in lines[1:]:
    

    col1 = np.append(col1, row.split()[0])
    col2 = np.append(col2, row.split()[1])
    col3 = np.append(col3, row.split()[2])
    col4 = np.append(col4, row.split()[3])
        
col1 = col1.astype(int)
col2 = col2.astype(int)
col3 = col3.astype(float)
col4 = col4.astype(float)

In [None]:
print(f'Column 1: {col1}, Type: {type(col1[0])}')
print(f'Column 2: {col2}, Type: {type(col2[0])}')
print(f'Column 3: {col3}, Type: {type(col3[0])}')
print(f'Column 4: {col4}, Type: {type(col4[0])}')

# Writing to a File

Often times when we are doing research we want to save the output of our analysis to a file for later use. That way we do not need to redo the analysis but rather open up the file that has the subset of data we want. We can also use the $\textbf{open}$ function to write out to a file and save the results. All we would change from the syntax is change the mode, instead of using the 'r' command when we are reading a file we will change it to 'w+' as this allows us to write to a file and make it if it does not exist.

## String Formatting to Write to Files

One of the ways to write data to a file is by leveraging a tool in python called $\textit{string formatting}$, string formatting is a way to make the string output look clean as you can have everything be lined up how you want. It can also help you keep a certain number of significant figures and can even have data in exponential form. To use this format on the variable *value* see below

Basic Syntax:

    f'STRING TO WRITE {value}'

Advanced Syntax"

    f'STRING TO WRITE {value:.2f}' Converts data to a float and keeps it to 2 values after the decimal
    f'STRING TO WRITE {value:.3f}' Converts data to a float and keeps it to 3 values after the decimal
    f'STRING TO WRITE {value:.2e}' Converts data to a float and writes it in exponential form (ie: 1.23e-10)
    f'STRING TO WRITE {value:.3e}' Converts data to a float and writes it in exponential form (ie: 1.233e-10)
    
    Number before the decimal tells python how many spaces it will occupy in total, if its able to
    f'STRING TO WRITE {value:5.2f}' Converts data to a float and writes it in exponential form with 5 total spaces
    f'STRING TO WRITE {value:5.3e}' Converts data to a float and writes it in exponential form with 5 total spaces
    
You can even add multiple string formatting into the string by adding in {} for every variable that you want to apply string formatting.   
    
    f'{data1:.2f} {data2:.3f} {data3:.2e}'
    
More info on f-string formatting can be found here: 

https://www.geeksforgeeks.org/formatted-string-literals-f-strings-python/#

https://realpython.com/python-f-strings/ 


In [None]:
value = 0.0355

In [None]:
f'Value is: {value:.2f}'

In [None]:
f'Value is: {value:.2e}'

In [None]:
#Generating random data to write to a file
np.random.seed(100)

wavelength = np.linspace(1000, 4500, 1000)
flux = np.random.normal(loc = 3, scale = 5, size = 1000)
flux_err = np.random.normal(loc = 1, scale = 1, size = 1000)

In [None]:
#Code that makes a new file and generates the file handler variable spectra_file

#opening up the file and using the file handler variable spectra_file
spectra_file = open('Output_Spectra.txt', 'w+')

#writing to a file using the file handler and the .write() command. NOTE the \n (new line character) at the end

#writing out the column names of the file first
spectra_file.write(f'Wavelength Flux Flux_err \n')

#actual data
for wave, Fnu, Fnu_err in zip(wavelength, flux, flux_err):
    
    spectra_file.write(f'{wave:.3f} {Fnu:.2f} {Fnu_err:.2f}\n')
    
spectra_file.close()

# Caution: Overwriting Files

Be extremely careful when you are writing out data to a file especially with similar file names as if you write out data to a file that already exist it $\textbf{will overwrite}$ that file. There is no automatic updater or warning telling you about a duplicate file so be a bit more cautious when you are writing data to a file as it will overwrite the file if it already exist.

# Using Numpy function to read and write files

The open function is very useful but it can be really confusing on how to acquire the data, especially when the file has many data types. In the following code cells we cover how to use two numpy functions to open text files and cover a bit of their pros and cons.

## np.loadtxt

The first function we will cover to open text files is np.loadtxt and this is super convenient when you are dealing with a file that is all numerical data. This is because by default what np.loadtxt does is that it tries to load in the file as $\textbf{floats}$. You can run into problems when you have a file where there are column names, or the data in a column is all strings when using np.loadtxt to read in the file but we will cover some work arounds for this.

Basic Syntax to use np.loadtxt

    data = np.loadtxt(FILE/PATH/TO/FILE/FILENAME)
    
    
By default np.loadtxt will try to read in the file assuming it is all numeric data and convert it all to floats.

In [None]:
#try running the cell below and see what you get
data_loadtxt = np.loadtxt('Galaxy_Coordinates.txt')

In [None]:
#You may have gotten an error, open up the file and see what could be causing the issue

In [None]:
data_loadtxt = np.loadtxt('Galaxy_Coordinates.txt', skiprows=1)
#skiprows skips the the first entry which is the column names

In [None]:
ID1, ID2, RA, DEC =  np.loadtxt('Galaxy_Coordinates.txt', skiprows=1, unpack = True)
#unpacks grabs every column and returns them back individually

## np.genfromtxt

The next function we will cover to open text files is np.genfromtxt and this is a more general version of np.loadtxt. This is because this has the ability to read in data values (ints and floats) but can also read in strings and booleans. Let us use it in action.

Basic Syntax to use np.genfromtxt

    data = np.genfromtxt(FILE/PATH/TO/FILE/FILENAME)

In [None]:
data_genfromtxt = np.genfromtxt('Galaxy_Coordinates.txt', skip_header=1)

In [None]:
ID1, ID2, RA, DEC = np.genfromtxt('Galaxy_Coordinates.txt', skip_header=1, unpack = True)

# Opening a more complex file

In [None]:
just_data = np.loadtxt('Field_Coordinates.txt', 
                       skiprows=1, 
                       usecols=(0, 1, 2, 3))

In [None]:
just_field = np.loadtxt('Field_Coordinates.txt', 
                        skiprows=1, 
                        dtype = str, 
                        usecols=(4))

In [None]:
full_data = np.loadtxt('Field_Coordinates.txt', skiprows=1, dtype = str)

In [None]:
full_data

In [None]:
X = np.genfromtxt('Field_Coordinates.txt', 
                  skip_header= 1, 
                  unpack = True, 
                  dtype = None, 
                  encoding = 'utf-8')

In [None]:
#printing out Row 1 of X
X[0]

In [None]:
print(f'{type(X[0][0])} {type(X[1][0])} {type(X[2][0])} {type(X[3][0])} {type(X[4][0])}')

# Intro to Pandas and Dataframes

Pandas is a very cool package that allows you to manipulate data through a table. You can easily perform computations on these tables, quickly acquire a subset of data given a condition and allows for easy data access as well. They work very similar to Astropy tables as well so if you are comfortable with pandas and Dataframes, then astropy tables would be something you can easily get down as well. We will cover the different syntax for defining a DataFrame how to do dataframe manipulations, such as filtering and merging DataFrames. Lastly we will cover some of the pandas file IO so that you can use pandas to open and write all sorts of files with ease. 

# Pandas DataFrames

Pandas DataFrames are a very useful tool whenever you have a table that is organize dby row and columns. Pandas DataFrame has lots of built in functions that allow for ease of exploration and manipulation of a pandas DataFrame, let us look at what a DataFrame is and how we can use it to do astronomy research. 

# 1. Defining DataFrame Using Arrays

You can define a pandas data frame by first making an empty DataFrame and then adding in the column name and values one at a time to populate the DataFrame

In [None]:
# Making an empty DataFrame
DF_1d_arrays = pd.DataFrame()

In [None]:
print(DF_1d_arrays)

In [None]:
a = np.arange(0, 10, 1)
b = np.arange(150, 160, 1)
c = np.arange(200, 210, 1)

In [None]:
DF_1d_arrays['A'] = a
print(DF_1d_arrays)

In [None]:
DF_1d_arrays['B'] = b
print(DF_1d_arrays)

In [None]:
DF_1d_arrays['C'] = c
print(DF_1d_arrays)

# 2. Using Dictionaries

Dictionaries are a pretty cool type of container and are a class in and of their own. The way dictionaries work is that dictionaries work on a key-value pair system. Where you need to provide the dictionary with the key to get the corresponding value. The key can be numerical or a string but it has to be unique. You cannot have multiple entries of the same key holding different values. What would happen in this case is that the prior assigned value would be overwritten. Lets see an example of dictionaries in action and how we can convert them into pandas DataFrames.

In [None]:
DF_dictionary = {'A': a, 
                 'B': b, 
                 'C': c}

DF = pd.DataFrame(DF_dictionary)

In [None]:
DF

# 3. Using 2D-Arrays

In [None]:
two_d_array = np.random.uniform(low = -100, high = 100, size = (100, 3))
df_twod_arr = pd.DataFrame(two_d_array, columns= ['A', 'B', 'C'])

In [None]:
df_twod_arr

# DataFrames are Fancy Numpy Arrays

In [None]:
2 * DF

In [None]:
DF/3

In [None]:
DF + 100

In [None]:
DF/DF.loc[3]

# Accessing and Changing Data in DataFrame

Pandas DataFrame have two ways to access data and that is through the $\textbf{.loc}$ or $\textbf{.iloc}$ command and there are key differences between the two. In short $\textbf{.loc}$ is able to use the index name and column names to select the exact entry that you want from the DataFrame where $\textbf{.iloc}$ uses the index location entry similar to array indexing in the rows and columns to grab the data you are after. The general syntax for accessing data is the following:

DF.loc[[index1, index2, index3, ..., index_n], [Column1, Column2, ..., Column_m]]

DF.iloc[[index_idx1, index_idx2, ..., index_idx_n], [col_index1, col_index2, ..., col_index_m]]

In [None]:
print(f"Value at row 4 and column B: {DF.loc[3, 'B']}")
print()
print(f"Value at row 4 and column A: {DF.loc[3, 'A']}")
print()
print('Values in Rows 1-5 in Column A:')
print(DF.loc[[0, 1, 2, 3, 4], 'A'])
print()
print('Values in rows 1-5 in Columns A, B, C:')
print(DF.loc[[0, 1, 2, 3, 4], ['A', 'B', 'C']])
print()

In [None]:
#Changing the 5th row and Column 1 and 3 to 5000
DF.loc[4, ['A', 'C']] = 5000

In [None]:
DF

In [None]:
DF.loc[[0, 1, 2, 3, 4]] = 100005

In [None]:
DF

In [None]:
#Quick Excercise: 
#you find that the data from row 8 and column 1 and 2 are invalid due to 
#bad equipment to account for this change the values to the new value of -999

#INSERT CODE BELOW



In [None]:
#same output as above but using iloc
print(f"Value at row 4 and column B: {DF.iloc[3, 1]}")
print(f"Value at row 4 and column A: {DF.iloc[3, 0]}")
print()
print('Values in Rows 1-5 in Column A:')
print(DF.iloc[[0, 1, 2, 3, 4], 0])
print()
print('Values in rows 1-5 in Columns A, B, C:')
print(DF.iloc[[0, 1, 2, 3, 4], [0, 1, 2]])
print()

# Useful Pandas Features

In [None]:
hipparcos_df = pd.read_csv('NSF_REU_File.txt', sep = ' ',
                           index_col = 0)

In [None]:
# Useful pandas functions in the cells below

# Shows the first 5 entries of the DF, 
# useful to see what's the layout of the DF
hipparcos_df.head()

In [None]:
#simple description of the DF per column
hipparcos_df.describe()

In [None]:
#How many Nan values are in each column
hipparcos_df.isna().sum()

In [None]:
#Shows all the columns in the DF
hipparcos_df.columns

In [None]:
#Shows the index values of the DF
hipparcos_df.index

In [None]:
#accessing data using .loc
hipparcos_df.loc[[1, 1000, 11000], 
                 ['ra', 'dec', 'detectid', 'plya_classification']]

# Handeling Multiple Data Frames

In this section we will cover some ways of handeling multiple DataFrames. 

There are two main ways of merging two DataFrames and that is through the *concat* or the *join* function in pandas.

In [None]:
#generating Data
DF1 = pd.DataFrame({'Field': ['Aegis', "Aegis", 'COSMOS', 'UDS', 'GOODS-N', 'GOODS-N'], 
                    "RA": [15.67, 16.54, 270.21, 50.00, 100.23, 101.22], 
                    'DEC': [5.34, 6.54, 76.21, 20.23, 80.23, 79.22]}, 
                    index = [145, 112, 198, 43, 76, 31])

In [None]:
DF1

In [None]:
#generating Data 2
np.random.seed(101)
DF2 = pd.DataFrame({'f_g': np.random.normal(loc=5, scale = 2, size = 200), 
                    'e_g': np.abs(np.random.normal(0, scale = 1, size = 200)), 
                    'f_r': np.random.normal(loc=5, scale = 2, size = 200), 
                    'e_r': np.abs(np.random.normal(0, scale = 1, size = 200)), 
                    'f_i': np.random.normal(loc=5, scale = 2, size = 200), 
                    'e_i': np.abs(np.random.normal(0, scale = 1, size = 200)), 
                    'f_z': np.random.normal(loc=5, scale = 2, size = 200), 
                    'e_z': np.abs(np.random.normal(0, scale = 1, size = 200))})

In [None]:
#showing the first 5 rows and all the columns
DF2.head()

# Concat

The first function we will cover is the concat short for Concatenate will merge two dataframes vertically and this is super useful when you have two DataFrames with the same columns and is a super usefule way of adding more data to a main DataFrame. Below we show an example of what happens to the DataFrame with the same and different columns names when we use concat.

In [None]:
#using Concat on DF1 and DF2
concat_df = pd.concat([DF1, DF2])

In [None]:
concat_df

In [None]:
#input 145 and see what we get
concat_df.loc[]

In [None]:
#making a DF with similar column names to DF1
np.random.seed(43)
DF3 = pd.DataFrame({'Field': np.random.choice(DF1.Field.values, size = 100), 
                    'RA': np.random.uniform(low = 0, high = 360, size = 100), 
                    'DEC': np.random.uniform(low = -90, high = 90, size = 100)})

In [None]:
better_concat_df = pd.concat([DF1, DF3])

In [None]:
better_concat_df

# Join
The second function we will cover is the *join* function and this is a super useful function when you are taking one dataframe and merging it to another with similar indexes or similar keys. This is great when you have two DataFrames having the same indices/keys but have different column names as this is a way of merging two in a larger DataFrame or down selecting from a larger dataframe only the data that you are after. The way you perform a *join* in the python is using the DataFrames themselves. So if you have a dataframe DF1 and you want to join DF2 you do DF1.join(DF2) where it will match DF2 to DF1.

If you want the reverse you would need to reverse the input, ie DF2.join(DF1). Let's see this in action below.

In [None]:
#Applying join matching DF2 to DF1
joined_DF2_to_DF1 = DF1.join(DF2)

In [None]:
joined_DF2_to_DF1

In [None]:
#Applying join to DF1 starting from DF2
joined_DF1_to_DF2 = DF2.join(DF1)

In [None]:
joined_DF1_to_DF2.head()

In [None]:
joined_DF1_to_DF2.loc[[31, 43, 76, 112, 145, 198, 10, 11, 12, 13]]

# File IO with Pandas

Pandas comes with loads of useful reading functions for all sorts of data types. You can tweak up certain ones to make it read other formats not specified by the function. In the following section we will go over the read_csv function, how to use it to read in both csv files and text files as well as showing you how to save a file using pandas to_csv function.

In [None]:
#Example 1
DF_FIELD_Coords_ex1 = pd.read_csv('Field_Coordinates.txt', sep = ' ')

In [None]:
DF_FIELD_Coords_ex1.head()

In [None]:
#Example 2
DF_FIELD_Coords_ex2 = pd.read_csv('Field_Coordinates.txt', delim_whitespace=True)

In [None]:
DF_FIELD_Coords_ex2.head()

In [None]:
#Example 3
DF_Gal_Coords_ex3 = pd.read_csv('Galaxy_Coordinates.txt', sep = ' ')

In [None]:
DF_Gal_Coords_ex3.head()

In [None]:
#Example 4
DF_Gal_Coords_ex4 = pd.read_csv('Galaxy_Coordinates.txt',delim_whitespace=True)

In [None]:
DF_Gal_Coords_ex4.head()

In [None]:
#Saving the data as a text file
joined_DF2_to_DF1.to_csv('Final_Sample.txt', sep = ' ')