#  <center> MAS 332L Lab 2:  Introduction to Python II

In this class, we will be using Python to manipulate data, clean up data, and plot multiple variables on a single graph.
 
---

## Lesson Objectives

* You will learn how to:
    * Clean up messy data
    * Format and append dates that are easily plotted
    * Create time series plots with formatted axes and legends
 
  
---


# Before starting:

#### Saving Progress

One of the issues we ran into last class revolved around saving your progress. By clicking the "Save and Checkpoint" button, it should save your progress locally. You will be able to restart your notebook in Jupyter at the same point. If you do not understand how this works, please ask one of us before starting this assignment.

#### Assignment Details

We are asking you to submit both a copy of the notebook and images of the figures you will generate throughout the lesson. 

#### Commenting Code

As we progress through our Python assignments, it will become increasingly important to comment your code. This will allow you to revisit it at a later time or have someone else look over it to help more easily. 

#### Please acknowledge that you understand the instructions by copying and pasting each of the following into the next cells.

#I understand how to save my progress and reopen the notebook.

#I understand that I am being asked to save and submit copies of my notebook as well as figures where noted in the assignment.

#I understand that I need to comment my code. 


### Read in the Data File

Recall from the last lab that the Pandas package has a function for reading text/ASCII data called `read_csv()`. Although the function appears to be meant for reading CSV files,  `read_csv` will read *any* delimited data using the `sep=*` keyword argument. Below, you will import the [Pandas](https://pandas.pydata.org/) package and read in a data file that just downloaded in the previous step. Note that the path below is relative to the current notebook and you may have to change the code if you are running in a different environment.


Now let's import Pandas so that we can read the hydrography data file we uploaded into our Jupyter Notebooks directory.

In [None]:
#Import Pandas and alias it as 'pd'
import pandas as pd

In [None]:
#Defining the filename allows you to write code that can be reused with different files
filename = "MAS_332L_Lab_2_data"
#Although not descriptive, df is used in R and Python for data frames 
# because it is short and will need be used and reused often during data cleaning
df = pd.read_csv(filename, sep=',', engine='python')

In [None]:
#Peek at the data
df.head()

### Focal Dataset

The Fisheries Oceanography in Coastal Alabama (FOCAL) is a long-term hydrographic monitoring program. It is designed to give baseline data on the physical conditions (e.g. temperature, salinity, and dissolved oxygen). Today, we are looking at temperature data at FOCAL over time. Because it's a long-term monitoring program, there are many data available; we will be looking at how to manipulate the data to better view it. 

**Note that "mab" in the column headings means "meters above the bottom."**

Let's take a look at columns that pertain to date and time.

Using the solution boxes below, answer the following questions:
1. What does yyyy represent?
2. What do mm, dd, hr, min, and sec represent?
3. What date and time is represented in row 0? Write this in the form mm/dd/yyyy hr:min:sec


Now that we understand the date and time columns, let's check out how often the data is logged.

1. What is the timestep for the data? 
Hint: difference in time between each row
2. How many temperature measures are taken per timestep?

### Cleaning Up Messy Data

Time series are rarely ever nice, clean data sets. Missing values are common and can be the result of a misfunctioning sonde (data collector), biofouling on the sonde, or other lack of sample collection. Python calls missing values "NaN", which stands for "Not a Number". A first step of cleaning up data is finding and removing these NaN values.

First let's import Numpy so that we can do some simple data manipulation and calculations

In [None]:
#Import the NumPy package and call it 'np'
import numpy as np

First, let's look for NaNs visually. Open up the datafile on your machine and scroll through for a second.

Which column(s) contain(s) NaNs?

But this is a coding assignment. Let's look for them in the 1mab column.



In [None]:
#First let's rename 1mab as a variable so we can refer to it more easily. Note: variable names cannot start with numbers.
#Create a data frame as before.

mab1 = df['1mab']

Now let's find where the nans are using Python!

The np.isnan() function will categorize any nans as "True".

In [None]:
#The syntaz is np.isnan('variable').

nans1 = np.isnan(mab1) 

#### Count NaNs
Now, how many cells are NaNs? We can count how many number values we have using `np.count_nonzero()`

In [None]:
# Count how many NaNs, use the np.isnan(mab1) as your variable
np.count_nonzero(nans1)

#### Save a new variable without NaNs 
Now let's save a new variable that is `mab1` without NaNs. We can do this by "applying" a "NaN mask" to `new_mab1`. 

Think back to assingment 3. We made a mask to filter the data to find the temperature of the perfect pool. We defined the perfect pool as one where the temperature exceeded 31 (temp_mask = temp_surf  > 31). The mask was an array of boolean variables (true or false).

When we applied that mask, which data did python keep in the new array? 

Let's consider the output of isnan. Which values are true and which are false?

We want to get rid of the cells in columns that correspond to a NaN location in 'mab1'. This becomes crucial in plotting, when the dimensions of each column must agree with each other. 

We will use 'np.isnan(mab1)' as a mask, but we want the opposite of the default output: we want to mark the NaNs as 'False'. To do this, all you need to do is add a '~' in front of our 'nans'. The ~ will mark all NaNs as 'False'.

In [None]:
#Create a mask
nan_mask= ~nans1

#Check that we've set up our mask correctly by adding code below.


In [None]:
#Remember the line of code to apply a mask from Assignment 3: warm_data = temp_surf[temp_mask]
#Let's apply the NaN mask to mab1

new_mab1 = mab1[nan_mask]

Let's double check that the NaNs are deleted from `new_mab1` by using the `.size()` function. 

In [None]:
# How many elements in new_mab1?
new_mab1.size

In [None]:
# How many elements in mab1? Write and execute the code.


### Now do it again!

mab1 isn't the only column where we added NaNs! Look into the data 6 meters above the bottom and do the following:

Create a data frame for the data, count the nans, create a mask, and remove the nans from the column. 

Make a comment and answer: How many elements were in the column before and after you removed the nans?

Remember you can add more cells by using the plus button or entering command mode (esc, then A or B depending if you want the cell above or below).

# Date Formatting

#### Combine all date and time variables
Now that we've cleaned up the data, we need to create a datetimestamp so that we can graph everything

A datetimestamp simply means a variable that includes the date and time in a single column. Right now, we have our year, month, day, hour, minutes, and seconds in all separate columns. We need to combine those. 

We will use the pandas function `pd.to_datetime` to do this.

But first, we need to create a new dataframe with only the date and time variables. 

In [None]:
#Create new dataframe with all the date and time variables
df_date = df[['yyyy', 'mm', 'dd', 'hr', 'min','sec']]

pd.to_datetime also requires that the variables are named year, month, etc.
We need to change the names of our variables from yyyy to year, mm to month, etc.

In [None]:
#Convert column names to proper syntax for pandas'to_datetime' function

df_date.columns=["year","month","day","hour","minute","second"]

In [None]:
#Now we can run pd.to_datetime to convert our date and time to a datetimestamp
dates = pd.to_datetime(df_date)

#and insert it into the first column in the dataframe.

df.insert(0, 'Date', dates)

In [None]:
#Check out our original dataframe. Is the new Date variable added?
df.head()

Would you look at that! However, we now don't need the six columns following it, so let's delete those data from the dataframe.

We will use the function df.drop(). For this, we need to specify which column names. Delete those columns and take another peak at the data.

In [None]:
df=df.drop(columns=['yyyy', 'mm', 'dd','hr','min','sec'])
df.head()

Much better! Now let's plot some of the data.

# Plotting the Data

#### Import matplotlib
Let's import `matplotlib` so that we can do some graphing. 

In [None]:
#import matplotlib, call it plt
import matplotlib.pyplot as plt

#### Plot Date versus temperature at 1 meter above bottom
Now let's start with a simple plot of temperature over time 1 meter above the bottom.

Remember, we want to use the new_mab1 variable that has NaNs deleted, so that we don't graph the NaNs.
Our x (datetimestamp) is the newly created `df["Date"]`, and our y (temperature at 1 mab) is `new_mab1`.

Try the code below to plot `df["Date"]` versus `new_mab1`.

Does it work?

In [None]:
# Plot Date versus new_mab1
plt.plot(df['Date'], new_mab1)

#### Reading error messages

When trying to decipher error messages, it is often best to start at the end of the message itself.

Why didn't the plot code work? Copy and paste the reason for the error as a comment in the next cell

#### The dimensions of x and y were not the same!

Oh no! We deleted 39 nans from `mab1` to create `new_mab1`, but we did not delete those same cells from `df["Date"]`! 

That means there are extra cells in `df["Date"]` that do not correspond to a temperature value in `new_mab1`, and thus we cannot plot these two variables together. 

We need to get rid of the cells in `df["Date"]` that corresponded to the `mab1` NaN values! But no worries...that is simple enough. We'll just apply that "NaN mask" to the date column.

In [None]:
# Delete the Dates that corresponded to the nan values in mab1 by applying the "NaN mask", aka [~np.isnan(mab1)]
# We will call the modifed Date variable "newdate_mab1"

newdate_mab1 = df["Date"][nan_mask]

In [None]:
# Let's check out newdate_mab1
print(newdate_mab1)

#### Now x and y dimensions match
Excellent! Now our modified Date column starts at index 38 (remember python indexing begins at 0), which signifies that we deleted the first 39 Date cells that corresponded to the NaNs in mab1. Ok, the dimensions of newdate_mab1 and new_mab1 match now. We can plot!

In [None]:
# Plot newdate_mab1 versus new_mab1

plt.plot(newdate_mab1, new_mab1)

#### Modify x ticks 
Ok, now we've got a great first pass at a graph that shows temperature on the y axis and date on the x axis.

But the x axis shows tick marks for every other month. We want to see tick marks for every month. So let's change that.

#### Import dates module
First we want to import the `dates` module from `matplotlib` so that we can manipulate the dates x axis

In [None]:
# import dates module, name it mdates
import matplotlib.dates as mdates

Now we can manipulate the date axis. We're going to do the next four actions in one cell below.

#### Create a plot grid

Let's use the `plt.subplot()` command, which creates a grid where you can make multiple plots. We will just be making one plot, but this function allows us to have control of the axes on that plot.

#### Use Month Locator
Next, let's use `MonthLocator` from the `dates` module to pinpoint each month. `MonthLocator` has an interval argument that allows you to tell it which months you want pinpointed. For us, that interval is 1. But say you wanted to the axis to be marked every half year - then your interval would be 6.

#### Set the major axis
Now use `ax.xaxis.set_major_locator() ` to set the major axis. The major axis is all the labeled tick marks. You could also include a minor axis that are ticks without labels.

#### Plot using ax.plot
Now let's plot using `ax.plot` instead of `plt.plot`. Using `ax.plot` allows us to plot with the axis modifications we just made. 

In [None]:
fig, ax = plt.subplots() # create subplots to be able to manupilate axes
monthly_locator = mdates.MonthLocator(interval=1) # set your month locator interval to 1 to get monthly tick marks
ax.xaxis.set_major_locator(monthly_locator) # set your tick marks to the Major axis, aka, they will be labeled
ax.plot(newdate_mab1,  new_mab1) # now plot!

#### Reformat date labels
Now we have our x-ticks every month, but it's just giving us the year as the label. So let's change the format of the label. 
We can do that with `ax.xaxismdates.DateFormatter()`. Add it to the code from above to make a new subplot.

In [None]:
fig, ax = plt.subplots() # create subplots to be able to manupilate axes
monthly_locator = mdates.MonthLocator(interval=1) # set your month locator interval to 1 to get monthly tick marks
year_month_formatter = mdates.DateFormatter("%Y-%m") # four digits for year, two for month
ax.xaxis.set_major_locator(monthly_locator) # set your tick marks to the Major axis, aka, they will be labeled

ax.xaxis.set_major_formatter(year_month_formatter) # formatter for major axis only

ax.plot(newdate_mab1,  new_mab1) # now plot!

#### Rotate x tick labels
Ok we have our monthly date labels! But...they are stacked on top of each other and can't be read! But this is an easy fix. Let's use `fig.autofmt_xdate()` to rotate the date labels.

In [None]:
fig, ax = plt.subplots() # create subplots to be able to manupilate axes
monthly_locator = mdates.MonthLocator(interval=1) # set your month locator interval to 1 to get monthly tick marks
year_month_formatter = mdates.DateFormatter("%Y-%m") # four digits for year, two for month
ax.xaxis.set_major_locator(monthly_locator) # set your tick marks to the Major axis, aka, they will be labeled
ax.xaxis.set_major_formatter(year_month_formatter) # formatter for major axis only
ax.plot(newdate_mab1,  new_mab1) # now plot!

fig.autofmt_xdate() #rotate date labels

#### What are the seasonal temperature patterns?
Ok! Now let's take a closer look at our plot.

Which month(s) has the highest temperature values? Comment your answer below. 

And which season has the lowest temperature values? Comment your answer below. 

#### Label the y-axis
Sounds good! But let's add an y-axis label so that anyone looking at this plot can see that we are graphing temperature. We can use `ax.set_ylabel()` to do this.

 #### Add a title to the graph
 Let's also add a title to the graph with `ax.set_title()`.

In [None]:
fig, ax = plt.subplots() # create subplots to be able to manupilate axes
monthly_locator = mdates.MonthLocator(interval=1) # set your month locator interval to 1 to get monthly tick marks
year_month_formatter = mdates.DateFormatter("%Y-%m") # four digits for year, two for month
ax.xaxis.set_major_locator(monthly_locator) # set your tick marks to the Major axis, aka, they will be labeled
ax.xaxis.set_major_formatter(year_month_formatter) # formatter for major axis only
ax.plot(newdate_mab1,  new_mab1) # now run!
fig.autofmt_xdate() #rotate date labels

ax.set_ylabel('Temperature (°C)') # label y axis

ax.set_title('Temperature at Focal') # create a title

#### Add surface temperature to the graph
Awesome. Ok now that we have everything set up, let's add some more data to the graph! 

We have temperature at 20 depths in the water column. We've graphed `new_mab1`, which represents the temperature 1 meter above the bottom, without NaNs. 

Let's add the surface temperature. Which variable would that be? Answer below with a comment.

#### What is our x variable?
And what is the x variable, or date variable, we should be plotting surface temperature against? Answer below with a comment.

#### Let's add surface temp
Great. Now let's add the surface temperature to the graph. For our data, that's the 20mab column.

In [None]:
fig, ax = plt.subplots() # create subplots to be able to manupilate axes
monthly_locator = mdates.MonthLocator(interval=1) # set your month locator interval to 1 to get monthly tick marks
year_month_formatter = mdates.DateFormatter("%Y-%m") # four digits for year, two for month
ax.xaxis.set_major_locator(monthly_locator) # set your tick marks to the Major axis, aka, they will be labeled
ax.xaxis.set_major_formatter(year_month_formatter) # formatter for major axis only
ax.plot(newdate_mab1,  new_mab1) # plot temp 1 meter above bottom

ax.plot(df["Date"], df["20mab"]) # plot temperature 20 meters above bottom

fig.autofmt_xdate() #rotate date labels
ax.set_ylabel('Temperature (°C)') # label y axis
ax.set_title('Temperature at Focal') # create a title

#### Add a legend
Looks good! Now let's add a legend so that we know which temperature is at the surface and which is at the bottom. 

We can do that by adding a label to the plot function, `ax.plot(newdate_mab1,  new_mab1, label = "1 meter above bottom")`, and then adding `ax.legend()`.

In [None]:
fig, ax = plt.subplots() # create subplots to be able to manupilate axes
monthly_locator = mdates.MonthLocator(interval=1) # set your month locator interval to 1 to get monthly tick marks
year_month_formatter = mdates.DateFormatter("%Y-%m") # four digits for year, two for month
ax.xaxis.set_major_locator(monthly_locator) # set your tick marks to the Major axis, aka, they will be labeled
ax.xaxis.set_major_formatter(year_month_formatter) # formatter for major axis only

ax.plot(newdate_mab1,  new_mab1, label = "1mab") # plot temp 1 meter above bottom

ax.plot(df["Date"], df["20mab"], label = "20mab") # plot temperature 20 meters above bottom

fig.autofmt_xdate() #rotate date labels
ax.set_ylabel('Temperature (°C)') # label y axis
ax.set_title('Temperature at Focal') # create a title

ax.legend()

#### Compare surface and bottom temperature
Great! When is surface water temp higher than bottom temp?

When is surface temperature lower than bottom temp?

#### Save plot as pdf
Now let's save the plot we made as a pdf with `plt.savefig(name.pdf)`. This saves your plot as a pdf in your Jupyter Notebook directory.

In [None]:
fig, ax = plt.subplots() # create subplots to be able to manupilate axes
monthly_locator = mdates.MonthLocator(interval=1) # set your month locator interval to 1 to get monthly tick marks
year_month_formatter = mdates.DateFormatter("%Y-%m") # four digits for year, two for month
ax.xaxis.set_major_locator(monthly_locator) # set your tick marks to the Major axis, aka, they will be labeled
ax.xaxis.set_major_formatter(year_month_formatter) # formatter for major axis only
ax.plot(newdate_mab1,  new_mab1, label = "1mab") # plot temp 1 meter above bottom
ax.plot(df["Date"], df["20mab"], label = "20mab") # plot temperature 20 meters above bottom
fig.autofmt_xdate() #rotate date labels
ax.set_ylabel('Temperature (°C)') # label y axis
ax.set_title('Temperature at Focal') # create a title
ax.legend()

plt.savefig('temp1.pdf') #save plot as temp1.pdf in Jupyter Notebook directory

#### Save plot as png
You can save in other file formats too! Rerun the figure code and save it as a .png instead of a .pdf.

In [None]:
#Copy and paste figure code below 
#and modify the line of code to save as a .png using the command we gave you for saving it as a pdf


# Add temperature at other depths
#### Your turn: Add two more sets of temperature to the graph
Ok we are almost done! For the final exercise, choose two temperature columns that are between 1 meters above the bottom and 20 meters above the bottom and add those to the graph. 

To do this, all you have to do is add this line to the code for each of the column you choose: 
`ax.plot(df["Date"], df["column name"], label = "column name")`. 

Take a look at how the code for the plot changed when we added the surface temperature, if you have problems. Remember, copy and paste are your best friends. 



**YOU WILL NEED TO RENAME THE `plt.savefig()`.**

Otherwise, you will write over the earlier plots you saved! Rename the .png file so that you can also turn in a png of this plot.

In [None]:
# Copy and paste the figure code here, and add two new temperature columns you've chosen to graph
# Remember to comment your code! Add new comments to whatever you change

#### Compare temperature at various depths
How does the temperature of the variable you chose compare to the surface and bottom temperature? Comment your answer below. 

#### Now change the title 

Now change the title command to something else. Rerun the figure with the new title. 

In [None]:
#Rerun figure with new title, save it as a png


### Depth Profiles

Now let's plot some vertical profiles of the temperature data.

First, we need to modify the data a bit to get it in a usable format.

In [None]:
#Let's look at the data again.
df.head()

In [None]:
#Make a copy of the df and call it df_mab
df_mab=df.iloc[:,:]

Look at the first column. That's the row number stored in the dataframe. We don't want that information to make a vertical plot, so let's drop it.

In [None]:
#Set the index before transposing, or else 'row number' will be at the top
#Make it a new df 'df_mabs'

df_mab=df_mab.set_index('Date')
df_mab.head()

Let's transpose the data. This will put the mabs as the rows and the datetime as the columns. 

In [None]:
#Transpose, rename
df_t = df_mab.transpose()
df_t.head()

We have the mab information, but we need a column with the meters off the bottom data as an integer and not a string.

In [None]:
#Create a new column holding the mab information.
#np.arange will give us an array of integers.
mabs = np.arange(0,21)
mabs

In [None]:
#And now insert the array to the dataframe at the beginning (index=0)
df_t.insert(0, 'mab(m)', mabs)
df_t.head()

We now want to plot the temperature (x) versus depth (y). First, select a given date using the column number. We use the index 3000, which corresponds to the timestamp shown below. 

In [None]:
somedate = df['Date'][3000]
somedate

In [None]:
#Vertical plot
plt.figure(figsize=[5,8]) #you can set the figure size like so
plt.plot(df_t[somedate], df_t['mab(m)']) #plot the temperature data from somedate using the transposed dataframe.
plt.show()

#Take some time to add axes labels and a title too



#### Using a loop to add multiple depth profiles to the same plot

I want to make the same plot above but showing data from multiple dates. We have 8722 days worth of data; let's take only some of those.

In [None]:
#Start by selecting some number of dates from the dataframe
#Set these as the number of iterations we will run through
iterations = df_t.shape[1]//365 #using index==1 to get the number of columns based on the first row
iterations

Next, we will use a 'for-loop' to add each timestamp's data to the same figure.

A for-loop works by running a specified number of iterations. In our case, we have 24 dates we want to find. The for-loop will start at i=1 and run through i=24. For every i, it will find a date from the original dataframe, just like we did above. Then, we will plot that data from the transposed dataset. Each run through the loop will add another depth profile to the plot. We should end up with 24 depth profiles.

In [None]:
plt.figure(figsize=[10,10]) #Create a figure and set the figure size as 10inx10in

for i in range(iterations):
    somedate = df['Date'][i*356]
    plt.plot(df_t[somedate], df_t['mab(m)'])
    
#Add some lines of code below to add a figure title, axes titles, and a legend, as you did above.
#Add some code to save this figure too!


# We're done! 

#### You've learned how to clean up data by removing NaNs, how to reformat data and time columns into a datetimestamp, graph multiple lines on a single plot, and make vertical plots.

These are key fundamentals of being a physical and geological oceanographer!

#### REMEMBER to download and upload this assignment to Canvas. Please download both a .html and .ipynb version of the assignment and upload to Canvas. Also upload the five plots you saved. 


Download as>HTML(.html)

Download as>Notebook(.ipynb)