## Data lesson 4

Today we will learn about reading & manipulating data

In [None]:
# Add import statements
import numpy as np
import matplotlib.pyplot as plt

#### **Loading data**

So far, we have been using fake or manually entered data for our python tasks.

Now we will learn how to read in data from a file, like the output of an instrument in lab.

A *text file* is any file that can be read by a human being in a text editor (as opposed to binary).

It can have different file extensions, like .txt or .csv or .dat.

Today we will introduce 3 common ways to read data from a file.

#### **Reading with python**

We can load data into python with the following steps:
* `open()` the text file
* `readlines()` one at a time
* `close()` the file

We will demonstrate below with the file `.csv`.  This file format is *comma-separated*: a new column is indicated by a comma.

*Open the file first and look at what the content is before we read it into python.*

In [None]:
csvfile = open('practice_data.csv')
line = csvfile.readline()
line

In [None]:
line = csvfile.readline()
line

In [None]:
line = csvfile.readline()
line

We have now read the first three lines and printed them.

Notice that each time `readline()` is run, it automatically steps to the next line.

The output for each line is a string of the comma-separated numbers.  That isn't very useful if we want to do any subsequent tasks with it.  There is also a *newline* character `\n` at the end of each line that we no longer need.

Let's convert a line to two actual numbers.  

Step 1 is to remove the `\n`.  The simplest way to do this is to `replace()` the character we are getting rid of with an empty string.

In [None]:
line_clean = line.replace('\n', '')
line_clean

Step 2 is to break the string apart at the comma.  We use `.split()` to specify at which character to break the string apart.

In [None]:
line_split = line_clean.split(',')
line_split

Step 3 is to turn the resulting strings into numbers.  In this case we use integers, but in other cases you might want to use a float.

In [None]:
num_1 = int(line_split[0])
num_2 = int(line_split[1])

print(num_1)
print(num_2)

Once we are done with a file we have opened, it is best practice to close it.

In [None]:
csvfile.close()

With this approach we read lines in one at a time and manually turn them into numbers.

Below we use a *loop* to do this more efficiently.  We will learn more about these next week!

In the meantime, recognize that the code below is opening a file, and then repeating the actions we walked through above so that the first number in each row is stored in one list, and the second number in another list.

In [None]:
# create empty lists
months = []
days = []

csvfile = open('practice_data.csv')      # open file
for line in csvfile.readlines():
    line_clean = line.replace('\n', '')  # remove newline character
    line_split = line_clean.split(',')   # split into 2 strings around the comma
    month = int(line_split[0])           # turn into integer
    day = int(line_split[1])

    months.append(month)                 # add to list
    days.append(day)
csvfile.close()                          # close file

print(months) 
print(days)

The benefit of the line-by-line approach is that it is very customizable, while the drawback is that it is very verbose.

We can also use functions loaded from modules to assist us in reading data, which we will discuss next.

#### **CSV reader**

The `csv` module can be useful when you are reading in .csv data.  

*Import the `csv` module below.*

In [None]:
# Add code here


As in our last example, we will start by creating empty lists to store the data we are reading in.

We will then open the .csv file like before.

This time, we will create a `reader` variable with the `csv` module.  Like before, this allows us to step through rows one by one.

Unlike before, this will automatically separate the columns within each row.

In [None]:
months = []
days = []

csvfile = open('practice_data.csv') # open the file

data = csv.reader(csvfile)          # create the csv readier

for row in data:                    # step through
    months.append(int(row[0]))      # Turn first column into integer & add to list
    days.append(int(row[1]))        # Turn second column into integer & add to list

csvfile.close()                     # close the file

print(months)
print(days)

We see that we've ended up with the same result but with somewhat less hands-on work than the built-in python reader.

You can also imagine that this would be an efficient way to access specific columns if there are many columns in your .csv file.

#### **Reading with NumPy**

NumPy contains functions that convert a text file directly into an array.  

This is very convenient, though has somewhat less flexibility than the other methods.

Today we will use `np.genfromtxt()`.  This takes two arguments: the file name, and a string representing the `delimiter` (in this case a comma).

*Try loading the `practice_data.csv` file with `np.genfromtxt()`.*

In [None]:
# Fill in the file name and the delimiter
data = np.genfromtxt( , delimiter= )

*What is the number of dimensions, the shape, and the size of your new array?*

In [None]:
# Add code here


*Print out the number of days in the month of July.*

In [None]:
# Add code here


One useful trick is to *transpose* the array, which means that all of the values within one column are now in one row.  Because of the way that array indexing works, this can make it more convenient to plot the contents of spreadsheet-like data.

We do this with the syntax `arrayname.T`

In [None]:
data_T = data.T 
data_T

*Now make a plot of the number of days per month*

In [None]:
# Add code here

#### **Manipulating data**

Let's imagine we have read in some data from an instrument in the lab.

We would usually call this "raw" data: we need to take some steps to put it into a final form that we can report measurements with.

We will use an example IR spectral analysis to walk through some common operations you might need to perform on raw data.

*Open the file `spectrum1.txt`. What is the delimiter for this file?  What quantities are stored in the first and second column?*

*Now use `np.genfromtxt()` to read in `spectrum1.txt`.*

In [None]:
# Add code here
spec_data = np.genfromtxt( , delimiter= )

*For convenience, let's make one-dimensional arrays to store the x data (wavenumber) and y data (absorbance).*

In [None]:
# Assign the correct data to these variables
wavenumber = 
absorbance = 

It is often helpful to first plot your raw data to see what you are working with.  

*Make a plot of wavenumber versus absorbance.*

In [None]:
# Create plot here


We are interested in determining the peak absorbance of the band around 15 microns.  Here are the steps we will take to get there:
* Convert the x axis units from wavenumber (cm^-1) to wavelength (um)
* Subtract off a baseline so that the absorbance = 0 in regions where there are no absorption peaks.
* Zoom in on the wavelength range around the band of interest
* Find the maximum absorbance in that range
* Find the wavelength of maximum absorbance
* Bonus: integrate the area under the band

*Use array math to calculate the wavelength in micrometers for each wavenumber. Remember that wavenumber is 1/wavelength.*

In [None]:
# Add code here
wavelength_cm = 
wavelength_um = 

*Make a plot of wavelength vs. absorbance where you zoom in on the region of the spectrum containing just the peak of interest (~15 um) along with some baseline on either side.*

In [None]:
# Add code here


Now we want to find the indices of our arrays that correspond to this part of the spectrum.  We will use `np.logical_and()` to do this.

We will learn more about Boolean logic next week.  In the meantime, you can recognize that the function takes two arguments: the first argument filters for wavelengths above some value, and the second argument filters for wavelengths below a second value.

The resulting `idx` corresponds to the indices where both of those conditions are true. 

*Fill in the missing numbers to select part of the spectrum within some lower & upper wavelength limit.*

In [None]:
# Fill in missing numbers here
idx = np.logical_and(wavelength_um> , wavelength_um< ) 

We now want to make new variables to store the wavelength and absorbance values only within the specified range of wavelengths.

*Use `[idx]` instead of a number in brackets to select only the elements of interest from your wavelength and absorbance arrays.  Store these as new variables.*

In [None]:
# Add code here
wavelength_zoom = 
absorbance_zoom = 

*Subtract a constant number from the absorbance array such that the value falls to 0 in parts of the spectrum without an absorption peak.*

In [None]:
# Add code here
absorbance_baselined = 

Now we want to find the maximum absorbance value within this wavelength range.  

*Find the maximum value of the baseline-subtracted absorbance array.*

In [None]:
# Add code here


Note that if we had taken the maximum of the original absorbance array, it would have corresponded to the peak around 4.3 microns and not our peak of interest at 15 microns.

The next step is to find the wavelength value where the peak absorbance occurs.

*Find the index corresponding to the peak absorbance value.  Then, print the wavelength at that same index.*

In [None]:
# Add code here

Be careful not to mix indices from your zoomed-in arrays with indices from your original arrays!

With all of this information, you could perform a Beer's Law analysis to solve for a concentration based on the measured absorbance at a wavelength of 15.24um.

Sometimes it is also necessary to integrate the area under a curve.  We will practice that now using `np.trapz()`, which uses the trapezoid rule.

There are many other functions for numerical integration in python.  In particular, the `scipy` library contains many other options and is used commonly in scientific computing.

`np.trapz()` has two required arguments: the y values followed by the x values.  Note that this is the opposite of the order used in plotting!

*Calculate the area under the peak in your zoomed spectrum.*

In [None]:
# Add code here

**To summarize:**
We learned today how to read in data using multiple methods and handling multiple text file formats.

We practiced using array operations to perform simple conversions and calculations with our measured x and y data.

We learned how to identify and extract specifc ranges of interest within the arrays.

We learned how to numerically integrate the area under a curve.

Next class, we will learn about how to fit models to data.


**Try it on your own:**

If there is extra time, try performing the same analysis fully on your own for the peak around 3 microns in `spectrum2.txt`.