## EPS/ESE 135: Observing the Ocean
### Data Analysis Assignment 1: Intro to plotting data with Python

### Background

Throughout the semester, you will learn about many different types of oceanographic data, and your final project will involve working with a group to analyze and present an oceanographic data set. This assignment is your first opportunity to practice manipulating and visualizing oceanographic data using Python.

#### Working with Jupyter notebooks

This is a Jupyter Notebook, a Python IDE (integrated development environment) that is based around "cells," which can contain snippets of code or text. Jupyter notebooks are a versatile tool that can be useful for learning how code works, exploring a data set, or describing analysis steps and results using these integrated markdown text cells...which also makes them good for tutorials.

You should be able to download a copy of this file (`01_python_intro.ipynb`) and run it on your own machine to replicate and build on the results shown.

There are a few ways to create new cells, including by clicking the + sign at the top of the notebook, or by using one of the + icons inside the active cell to duplicate the current cell or create a new one above or below.

When you create a new cell, make sure that Jupyter interprets it correctly by selecting either "Markdown" (rich text), "raw" (plain text), or "code" from the dropdown menu at the top. Then you can execute the cell, which will format the text or run the code, by clicking the single forward arrow at the top or by pressing command + enter (on a Mac).

There are many online resources with more explanation, keyboard shortcuts, Markdown tips, etc...here is the [Project Pythia Jupyter tutorial](https://foundations.projectpythia.org/foundations/getting-started-jupyter/) to get you started if you're looking for more information.

#### The kernel

When you run Python code in a Jupyter notebook it creates a "kernel," which is basically a contained computational environment for your calculations. Because Python is modular, the first thing you have to do is import the libraries that you will call in your code. Those libraries need to exist in the Python environment you are working with (e.g. our `eps135` environment) so that you can import them.

When you create or manipulate variables these are also stored in the kernel until it is reset. You can restart the kernel using the _Kernel_ menu or by clicking on the circular arrow at the top of the notebook. If you do this (or if you close and reopen the file), you will need to rerun all the code, including the first step of importing your code libraries.

### Running Python code in a Jupyter notebook

Please run each cell containing Python code by clicking the single forward arrow at the top of the notebook or by pressing shift + enter (on a Mac). Make sure that the output (if any) makes sense before moving on to the next cell. (Not all code will generate output.)

#### Importing Python packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from datetime import datetime, timedelta

By using the `import ... as` syntax, we've created an alias for each of these packages. There's nothing stopping you from calling them something else, but these are incredibly commonly used, and it will make it much easier to share code if you stick to these conventions.

#### Basic calculations

For starters, you can use Python as an overpowered calculator. Basic mathematical operations are built in.

In [None]:
12 * 87 + 22 / 7

In [None]:
10 ** 2 #you can also add in-line comments like this

This starts to get more useful when you save variables. We will create a simple numpy array from a list of integers --- to learn more about numpy, check out this [numpy primer from Project Pythia](https://foundations.projectpythia.org/core/numpy/numpy-basics/).

In [None]:
a = np.array([12, 17, 25, 20, 19, 20, 16, 15, 21, 22]) # creates a numpy array
print(a) # displays the value of variable a

Now variable `a` is stored in the kernel. If you want to refer only to certain values of `a`, you can index into the array using square brackets. In Python's indexing system, the first value is 0, and the colon indicates "through," so `a[0:3]` means zero through 3 and returns the first 3 values of the array. You can also skip the zero and simply index `a[:3]` for the same result.

To get the last `n` values, you can do negative indexing, i.e. `a[-n:]`:

In [None]:
print(a[0:3]) # selects the first 3 values of a
print(a[:3]) # also selects the first 3 values of a
print(a[-4:]) # selects the last 4 values of a

Now we can do all kinds of manipulations and calculations with the saved variable.

In [None]:
# Basic mathematical operations:
print(a[2:6]*2) # take a subset of values of a and multiply them by 2

# Statistical calculations:
print(np.min(a)) # minimum value
print(np.mean(a)) # mean value
print(np.std(a)) # standard deviation

This list of values is slightly confusing. You can include text strings in your `print` commands within single or double quotes that will make the outputs more intelligible. When you are working with text strings rather than numbers, the `+` sign concatenates strings together.

Let's say that the values in `a` are temperatures in degrees Celsius. It's important always to be explicit about the units of values that you are reporting.

In [None]:
# save the mean value of a as new variable:
a_mean = np.mean(a)

# convert a_mean from a numeric variable to a string, which python will interpret as text:
a_mean_string = str(a_mean)

# use + signs to combine strings into an easily readable output:
print('The mean temperature is '+a_mean_string+' degrees C.')

We can define a function that will convert the Celsius temperatures in `a` to Fahrenheit. (You don't need to define a function to do a calculation once, but if you plan to do it many times, this can streamline your code and help avoid errors.)

In [None]:
# this is the syntax to define a function:
def c_to_f(temp_c): # function name: c_to_f; input argument: temp_c
    temp_f = temp_c * 9/5 + 32 # perform calculation on input variable temp_c and save result as temp_f
    return temp_f # return output variable temp_f

Running the above cell will save the new function to the kernel but won't give any output. However, now that the function has been created, we can try it on our variable `a`:

In [None]:
temp_f = c_to_f(a)
print(temp_f)

### Please install Python on your laptop and run cells up to this point before class on Thursday, Sept 11.

#### Exercise 1:
As you are doing these exercises, remember to comment your code inline throughout using the `#` symbol.

1. Calculate the mean of the variable `temp_f` and save this as a new variable. Give it a simple, descriptive name such as `temp_f_mean`.
2. Use `print` to display the output as a sentence.
3. Following the above example, define a function called `f_to_c` that takes the temperature in Fahrenheit as its input and returns the Celsius temperature.
4. To test that your function works as expected, use your new function to convert `temp_f_mean` into degrees Celsius. `print` the output and compare it to the result that we got from computing the mean of the original variable `a`. Did you get the result you expect?

In [None]:
# use this empty code cell for exercise 1. You may create additional cells!

#### Plotting data

[Matplotlib](https://matplotlib.org/) is a powerful tool for visualizing data with Python. It is also quite well-documented and there are many helpful examples on their website. Again, [Project Pythia's tutorial](https://foundations.projectpythia.org/core/matplotlib/matplotlib-basics/) is a great starting point if you'd like to learn more.

At its most basic, though, we can create a line plot by calling [plt.plot(___)](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html), which will be shown inline by Jupyter.

In [None]:
plt.plot(a)

Now we can see what the data look like, but if we want to be able to interpret them, we need to add some context to the plot. Let's say these values are the daily high temperatures in Boston over 10 days (maybe in April when the weather is very chaotic). We need to create a time variable to plot along the x-axis, and add axis labels so the viewer knows how to read the plot.

In [None]:
# define time variable:
time = np.arange(1,11) # this creates an array of integers from 1 to 10

# plot data:
plt.plot(time, a) # plots time on the x axis and a (temp) on the y axis

# add axis labels: 
plt.xlabel('day')
plt.ylabel('temperature [deg C]')

# add plot title -- you should do this as you work through exercises, but if you are including plots
# in written work like a lab report or manuscript, you should remove the title and put that information
# in the figure caption
plt.title('Daily high temperature')

There are lots of other choices you can make in creating your plots once you've covered these basics, such as changing colors, adding markers, grid lines, etc. which are covered in the tutorial and documentation linked above. You can also plot multiple data sets or variables on the same axes by using the same plot command, or other matplotlib functions like [plt.axhline(___)](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.axhline.html) which will plot a horizontal line at a chosen value.

In [None]:
# plot data:
# the additional arguments specify line color, add a circle marker,
# and tell matplotlib what label to give this line in the legend
plt.plot(time, a, color='green', marker='o',label='daily high')

# plot a horizontal line at the mean value:
plt.axhline(a_mean, color='orange', label='mean')

# add axis labels and title: 
plt.xlabel('day')
plt.ylabel('temperature [deg C]')
plt.title('Daily high temperature')

# add grid lines:
# this function doesn't need arguments, but we still need to give it
# the empty parentheses
plt.grid() 

# add legend:
# the labels we specified earlier will show up in the legend and it 
# will automatically try to put it somewhere on the axis where it's not
# covering up our data
plt.legend()

#### Loading and plotting timeseries data with pandas

Now we can try to visualize some actual oceanographic data. This data set is longer than 10 data points, so we'll read it in from a csv file. This stands for "comma separated values," and if you take a look at `timeseries.csv` in a text editor, you will see why. 

In [None]:
cf3 = pd.read_csv('timeseries.csv')
cf3

Working with dates and times can be non-trivial, as there are so many ways to format them. In this case, the time variable is in units of "days since January 1, 0000," so the first step will be to redefine the time variable to a data type called "datetime" that both pandas and matplotlib can use more sensibly.

In [None]:
# in this cell we define a function called convert_series that will do this date format conversion.
# full disclosure: I used chatGPT to streamline this function.
def matlab_datenum_to_datetime(dn):
    """
    Convert a single MATLAB datenum (float) to Python datetime.
    """
    return datetime.fromordinal(int(dn)) + timedelta(days=dn % 1) - timedelta(days=366)

def convert_series(dn_series):
    """
    Convert a Pandas Series of MATLAB datenum values to datetime.
    """
    return dn_series.apply(matlab_datenum_to_datetime)

In [None]:
# use the newly defined function convert_series to replace the original time variable with datetime

cf3.time = convert_series(cf3.time)
cf3

Okay, you can see that we now have intelligible timestamps for our data! Pandas has many other powerful options for manipulating data using this time variable as an index, some of which are described in the [Project Pythia pandas tutorial](https://foundations.projectpythia.org/core/pandas/pandas/).

This data set is a temperature and pressure record from a mooring in the North Atlantic Ocean off the southeast coast of Greenland, which is part of the [OSNAP array](https://www.o-snap.org/). The variables are:
* temperature [degrees Celsius]
* pressure [dbar (decibars)]

#### Exercise 2:
__Now you will create some plots of the data in the `cf3` dataframe.__

__Plot 1/3:__
* Plot temperature (y-axis) versus time (x-axis). Add x- and y-axis labels and a simple title.

(Hint: To work with variables in a pandas dataframe, use dot indexing as we did above, e.g.
`plt.plot(cf3.time,cf3.temp)`.)

In [None]:
# create Plot 1 here

__Plot 2/3:__ 
* Calculate the mean pressure value and use `print` to display the value in a complete sentence (including units!).
* Plot pressure versus time and add a horizontal line with the mean pressure value to your plot. Use the `label='___'` argument in each plot command.
* Add x- and y-axis labels, a legend, and a title.

In [None]:
# create Plot 2 here

__Plot 3/3:__ 

Let's look a little more closely at that pressure timeseries to understand the variations around the mean. 
1. Look back at the converted timestamps above. What is the time interval between consecutive measurements (rounded to the nearest minute...you can ignore the excessive decimal places)? How many measurements are recorded per hour, and how many per day?
2. Now plot pressure versus time again, but only for the first 30 days of the time series. To do this, you can index into the time and pressure variables using square brackets as shown in the first section of this notebook, e.g. `cf3.time[:___]`. (It's okay if the x-axis labels are overlapping a little, we'll fix this another time.) Add axis labels and title.
3. In one or two sentences, what do you think you might be seeing here?

In [None]:
# create Plot 3 here

#### Loading and plotting profile data

In a moored record, we observe how a variable (such as temperature) is changing over time at a fixed location and depth (well, that's the goal, which may not always work exactly as planned...). In that case, time is our independent variable, plotted on the x-axis, and temperature is the dependent variable, plotted on the y-axis.

In the ocean, however, we also frequently collect profile data by lowering a sensor through the water column and observing how it changes with depth in the ocean. As you have heard, the instrument that we use to do this is called a CTD (conductivity, temperature and depth -- where depth is generally analogous to pressure in the ocean).

When plotting a CTD profile, our independent variable is therefore depth (or pressure), and because it is a vertical measurement, we plot the independent variable on the y-axis. This can be confusing at first if you're not used to looking at plots oriented this way, but will become more intuitive.

I've given you another csv file, this time containing a CTD profile from the southeast Greenland continental shelf. The variables are:
* pressure [dbar (decibars)]
* temperature [degrees C]
* salinity [psu (practical salinity units)]

In [None]:
ctd = pd.read_csv('ctd_raw.csv')
ctd

#### Exercise 3:
__Plot 1/2:__
1. Plot temperature (x-axis) versus pressure (y-axis). Where is the surface (pressure = 0 dbar) in your plot? Where is the ocean floor (highest pressure)? Does this make sense to you?
2. Set the y-axis limits for your plot using this command: `plt.ylim(bottomvalue,topvalue)`. Let the bottom y-value be +500 and the top y-value be -5. Add axis labels and a title to your plot.
3. Describe how the temperature changes with depth, beginning from the surface. Does anything about this surprise you?

In [None]:
# create Plot 1 here

#### Creating figures with subplots

Jupyter makes it easy to create basic visualizations as we've done so far with these basic Matplotlib commands, but sometimes you will want more control over different aspects of your plots, which you can do by creating figure and axes "objects" with the command `fig,ax = plt.subplots()`. If you leave the parentheses empty, this will simply create one set of axes like our earlier plots but you will be able to use the `ax` handle to manipulate them.

One simple but useful thing this allows you to do is create figures with subplots so that you can look at related data side by side, by include values in the parentheses using this syntax `fig,ax = plt.subplots(rows,cols)`.


__Plot 2/2:__

I have given you the command to generate 1 row of 2 subplots. This creates two "axes objects", `ax[0]` and `ax[1]`. I have then showed you how to plot and set the y-axis limits on the lefthand subplot.

You might notice these have a subtly different syntax than the `plt.___` commands in some cases, e.g. we used `plt.ylim(___)` before and here we have to use `ax[0].set_ylim(___)`. If you run into errors, this is a good thing to check.

1. Finish the temperature profile by setting appropriate y-axis limits and adding axis labels.
2. Plot the salinity data in the righthand subplot `ax[1]`, add labels etc.
3. Describe how salinity changes with depth. Does anything about this surprise you?

In [None]:
# create Plot 2 here

# create fig and ax objects:
fig,ax = plt.subplots(1,2) # plt.subplots(rows,cols)

# plot temperature profile on left
ax[0].plot(ctd.temp,ctd.pres)
#ax[0].set_ylim(...
#ax[0].set_xlabel(...

# plot salinity profile on right
#ax[1].plot(___)

All done! If you have any questions or feedback, feel free to add more cells after this with your thoughts about what was confusing, what could be improved, or what you'd like to learn more about... :)

__Then please print this page as a PDF and submit it on the Canvas assignment by September 16.__