## Ocean Biogeochemical Dynamics Lab, Spring 2021
### Introduction to Python and Jupyter Notebooks
### Importing and cleaning data

In [None]:
# This is how you make a comment!
# Always annotate your code so you know what you did and why
# I promise you won't remember the details when it comes time to write up the results

# Ideally, you are also using some kind of version control like github
# version control allows you to track changes in your code, revert back if you need to,
# or even branch a piece of code off into two independent versions.
# Also great for collaborative projects

# Always start by importing the tools you will need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.path as mpath
import seaborn as sns # this will change the look of pandas plots too
import cartopy.crs as ccrs
import cartopy.feature
%matplotlib inline 
# this forces matplotlib to print figures out here when you make plots

# Press Shift + Enter to "run" this cell and move on to the next one
# Press Option + Enter to add run and an empty cell below this one 

In [None]:
# Start a new "cell" when you transition to a new step in your code
# This allows you to run your code in chunks and better troubleshoot where issues might be.

# Bring in the dataset you will need
# This code imports a SOCCOM biogeochemical Argo Float "snapshot" dataset from December 2020 for one float.
# The entire .zip file with all floats can be downloaded here: 
# https://library.ucsd.edu/dc/object/bb94601812 as the "LIAR High resolution ODV format"
# It is easiest to place the file in the same directory where you will keep this code

# Download, and then go and unzip that file. What's inside?


# In this case, a "snapshot" means that the data have been quality-controlled 
# and archived with a "doi" or digital object identifier.
# It is important to use a dataset that has been "frozen" somewhere and to document 
# every step you take from that point on, including cleaning, reformatting, renaming, unit conversions,
# and calculations.

# We want to use Pandas' built-in read_csv function to import the data file into a pandas
# data frame called "flt"

flt=pd.read_csv('SOCCOM_HiResQC_LIAR_22Dec2020_odvtxt/9091SOOCN_HRQC.TXT', error_bad_lines=False)
# There are a bunch of other input options for this function, and you can see them by
# pressing "tab" inside the function



In [None]:
# Python doesn't typically spit out the results unless you ask for them.
# This is how you look at just the "head" the flt dataframe
flt.head()

In [None]:
# This is how you look at just the whole flt dataframe
flt

In [None]:
# Clearly something is wrong. We didn't get any meaningful data! Why? Because those are comment lines in the data file.
# Let's try telling read_cvs what a comment looks like
flt=pd.read_csv('SOCCOM_HiResQC_LIAR_22Dec2020_odvtxt/9091SOOCN_HRQC.TXT', error_bad_lines=False, comment='/')
# There are a bunch of other input options for this function, and you can see them by
# typing a comma and then pressing "tab" from inside the function parentheses
# Run this new read_csv function and look at the header of the new flt dataframe
# by running this code you overwrite your last flt dataframe
flt.head()

In [None]:
# We are getting warmer, we now see a more meaningful header and some data, but what are those "\t"s?

# Those are tab delimiters/separators. CSV means comma separated values and TSV means tab separated values.
# The files typically look identical and the delimiters are invisible when viewed from excel or a text editor.
# So now we need to tell read_csv what the delimiter is.
flt=pd.read_csv('SOCCOM_HiResQC_LIAR_22Dec2020_odvtxt/9091SOOCN_HRQC.TXT', error_bad_lines=False, 
                comment='/', delimiter='\t')
# simply press enter in the code to continue your code onto a new line if you're inside a parentheses
# Look at the header
flt.head()

In [None]:
# That looks better, now let's look at the info for the file to see more:
flt.info()

In [None]:
# We are getting many rows of data, but only four columns. Why? 

# Because the "comment" character used in these float data files is two forward slashes "//"
# Unfortunately, Pandas read_csv can only handle one character in the comment field.

# Because we entered '/' as the comment character, we also lose everything after any '/'
# In this case, we lose everything after "mon" in the header row and after the month number in all data rows
# Basically, you should never use your delimiter/separator character anywhere else in your data file.

In [None]:
# To work around this, we will first open the file and replace all instances of '//' with '#'
# I checked to make sure "#" isn't used in the actual data anywhere

# input file
fin = open('SOCCOM_HiResQC_LIAR_22Dec2020_odvtxt/9091SOOCN_HRQC.TXT','rt',encoding='UTF-8')

# output file to which we will write the result
fout = open('fltrem.txt','wt')

# This how you define a function. 
# You write functions in order to generalize a process so you can use it over and over
# without having to rewrite each time.

# for each line in the input file
for line in fin:
    # red and replace the string and write to the output file
    fout.write(line.replace('//','#'))
# close the files
fin.close()
fout.close()

In [None]:
flt=pd.read_csv('fltrem.txt',delimiter='\t',comment='#',na_values=-1E10)
# Now I've also added a term to tell read_csv what the "not a number" value is in the file
# NA values are used to fill in where there is either no data or sometimes bad data which has been removed
# They used an absurdly small number which is not zero, but is so small 
# that you would never get the value from a sensor. You could also use an absurdly large number

In [None]:
#take a look at the info for the flt dataframe you have made
flt.info()

In [None]:
# look at the head of the data frame
flt.head()

In [None]:
# Notice that the date format is in a text string, and we will want it in a number format
flt['date']=pd.to_datetime(flt['mon/day/yr']+' '+ flt['hh:mm'])
flt.info()

In [None]:
# Notice the QF or Quality Flag columns. These tell us which data are good, questionable, and bad.
# We only want to use good data. How can we tell which data are good?
# Go back to the text file comments. We should remove bad and questionable data flagged 4 and 3, respectively.
# We will do this later.

In [None]:
# Make a map using cartopy (basemap is deprecated!)

plt.figure(figsize=(6, 6))
ax = plt.axes(projection=ccrs.SouthPolarStereo())
ax.set_extent([-180,180,-90,-30],ccrs.PlateCarree())
ax.add_feature(cartopy.feature.LAND)
ax.add_feature(cartopy.feature.OCEAN)
ax.gridlines()

# Compute a circle in axes coordinates, which we can use as a boundary
# for the map. We can pan/zoom as much as we like - the boundary will be
# permanently circular.
theta = np.linspace(0, 2*np.pi, 100)
center, radius = [0.5, 0.5], 0.5
verts = np.vstack([np.sin(theta), np.cos(theta)]).T
circle = mpath.Path(verts * radius + center)

ax.set_boundary(circle, transform=ax.transAxes)

plt.plot(flt['Lon [°E]'],flt['Lat [°N]'],color='Black',transform=ccrs.PlateCarree())

plt.savefig('SPstereo.pdf')
plt.savefig('SPstereo.png')

plt.show()

In [None]:
# Let's make a quick plot of temperature vs pressure
plt.plot(flt['Temperature[°C]'],flt['Pressure[dbar]'])
# This method is quick and dirty but doesn't give us much control over the figure

In [None]:
# Something funny? 
# We want to invert the axis and add some labels

# Now we will use the object-oriented programming to have more control over the plot
fig = plt.figure()
# this allows you to create multiple axes
axes1= fig.add_axes([0, 0, 1, 1])
axes1.plot(flt['Temperature[°C]'],flt['Pressure[dbar]'])
axes1.set_title('Float 9091 Temperature') # There are ways to make the figure title dynamic,
# i.e. adapts to the float number
axes1.invert_yaxis()
axes1.set_xlabel('Temperature [°C]')
axes1.set_ylabel('Pressure [dbar]')
# if you wanted to add a subplot you would add it like this
#axes2= fig.add_axes([.7, .7, .2, .2])

In [None]:
# Can also use subplots function
fig,axes = plt.subplots(nrows = 1, ncols = 2,figsize=(6,6))
# if you have many subplots and some overlap, use tight_layout, or you can leave it 
# at the end of all of your plot statements
# plt.tight_layout()

axes[0].plot(flt['Temperature[°C]'],flt['Pressure[dbar]'])
#axes[0].set_title('Temperature')
axes[0].invert_yaxis()
axes[0].set_ylabel('Pressure [dbar]')
axes[0].set_xlabel('Temperature [°C]')

axes[1].plot(flt['Salinity[pss]'],flt['Pressure[dbar]'])
#axes[1].set_title('Salinity')
axes[1].invert_yaxis()
axes[1].set_ylabel('Pressure [dbar]')
axes[1].set_xlabel('Salinity [pss]')

plt.tight_layout()
# This is the first figure we're saving. We have given it a name, a type, and a dpi or
# dots per inch which is resolution
fig.savefig('F9091TS.png', dpi = 200)

In [None]:
# Now let's plot multiple things on one axis
fig = plt.figure()
ax = fig.add_axes([0.1, 0.1, .8, .8])
ax.scatter(flt['DIC_LIAR[µmol/kg]'],flt['Pressure[dbar]'],label = 'DIC(umol kg-1)', color = 'red')
ax.scatter(flt['TALK_LIAR[µmol/kg]'],flt['Pressure[dbar]'],label = 'TALK(umol kg-1)', color = 'blue') 
# for color you can also put in an RGB hex code
ax.legend(loc=0) # 0 is for the "best" location
ax.set_title('Float 9091')# Figure out how to have this be dynamic and change with the float number
ax.invert_yaxis()
fig.savefig('F9091DICTALK.png', dpi = 200)

In [None]:
# This is where we left off in class on Jan 22

# We see funny data here. Maybe we need to use the quality flags now
# Keep only good data flagged zero and replace rest with 'nan' which is "not a number"
flt['DIC_LIAR[µmol/kg]'] = np.where(flt['QF.17'] == 0 , flt['DIC_LIAR[µmol/kg]'], np.nan)

# Now replot that same figure
fig = plt.figure()
ax = fig.add_axes([0.1, 0.1, .8, .8])
ax.scatter(flt['DIC_LIAR[µmol/kg]'],flt['Pressure[dbar]'],label = 'DIC(umol kg-1)', color = 'red')
ax.scatter(flt['TALK_LIAR[µmol/kg]'],flt['Pressure[dbar]'],label = 'TALK(umol kg-1)', color = 'blue') 
# for color you can also put in an RGB hex code
ax.legend(loc=0) # 0 is for the "best" location
ax.set_title('Float 9091')# Figure out how to have this be dynamic and change with the float number
ax.invert_yaxis()
fig.savefig('F9091DICTALKQC.png', dpi = 200)

In [None]:
# If we use "plot" we get a HUGE gap in the mid-water column. 
# The BGC Data are lower resolution so we need to use scatter.
fig = plt.figure()
# this allows you to create multiple axes
axes1= fig.add_axes([0.1, 0.1, .8, .8])
axes1.scatter(flt['pHinsitu[Total]'],flt['Pressure[dbar]'],label = 'insitu pH',color='purple')
axes1.legend()
axes1.set_title('Float 9091')
axes1.invert_yaxis()
fig.savefig('F9091pH.png', dpi = 200)

In [None]:
# Keep only good data flagged zero and replace rest with 'nan' which is "not a number"
flt['pHinsitu[Total]'] = np.where(flt['QF.14'] == 0 , flt['pHinsitu[Total]'], np.nan)
fig = plt.figure()
# this allows you to create multiple axes
axes1= fig.add_axes([0.1, 0.1, .8, .8])
axes1.scatter(flt['pHinsitu[Total]'],flt['Pressure[dbar]'],label = 'insitu pH',color='purple')
axes1.legend()
axes1.set_title('Float 9091')
axes1.invert_yaxis()
fig.savefig('F9091pHQC.png', dpi = 200)

In [None]:
# is nitrate there?
fig = plt.figure()
# this allows you to create multiple axes
axes1= fig.add_axes([0.1, 0.1, .8, .8])
axes1.scatter(flt['Nitrate[µmol/kg]'],flt['Pressure[dbar]'],label = 'Nitrate (umol kg-1)',color='green')
axes1.legend()
axes1.set_title('Float 9091')
axes1.invert_yaxis()
fig.savefig('F9091Nitrate.png', dpi = 200)

In [None]:
# Keep only good data flagged zero and replace rest with 'nan' which is "not a number"
flt['Nitrate[µmol/kg]'] = np.where(flt['QF.8'] == 0 , flt['Nitrate[µmol/kg]'], np.nan)

fig = plt.figure()
# this allows you to create multiple axes
axes1= fig.add_axes([0.1, 0.1, .8, .8])
axes1.scatter(flt['Nitrate[µmol/kg]'],flt['Pressure[dbar]'],label = 'Nitrate (umol kg-1)',color='green')
axes1.legend()
axes1.set_title('Float 9091')
axes1.invert_yaxis()
fig.savefig('F9091NitrateQC.png', dpi = 200)

In [None]:
# is oxygen there?
# Keep only good data flagged zero and replace rest with 'nan' which is "not a number"
flt['Oxygen[µmol/kg]'] = np.where(flt['QF.6'] == 0 , flt['Oxygen[µmol/kg]'], np.nan)
fig = plt.figure()
# this allows you to create multiple axes
axes1= fig.add_axes([0.1, 0.1, .8, .8])
axes1.scatter(flt['Oxygen[µmol/kg]'],flt['Pressure[dbar]'],label = 'Oxygen (umol kg-1)',color='teal')
axes1.legend()
axes1.set_title('Float 9091')
axes1.invert_yaxis()
fig.savefig('F9091Oxygen.png', dpi = 200)

In [None]:
# Next look at oxygen over time
fig = plt.figure(num=None, figsize=(10, 2), dpi=80, facecolor='w', edgecolor='k')
ax = fig.add_axes([0.1, 0.1, .8, .8])
sc=ax.scatter(flt['date'],flt['Depth[m]'],c=flt['Oxygen[µmol/kg]'],cmap = 'magma',)
ax.invert_yaxis()
ax.set_ylim([300,0])
cb=plt.colorbar(sc)
cb.set_label('Oxygen[µmol/kg]')
sc.set_clim(vmin = 180, vmax = 380) 

In [None]:
# Next look at pCO2 over time
fig = plt.figure(num=None, figsize=(10, 2), dpi=80, facecolor='w', edgecolor='k')
ax = fig.add_axes([0.1, 0.1, .8, .8])
sc=ax.scatter(flt['date'],flt['Depth[m]'],c=flt['pCO2_LIAR[µatm]'],cmap = 'magma')
ax.invert_yaxis()
ax.set_ylim([300,0])
cb=plt.colorbar(sc)
cb.set_label('pCO2[µatm]')
sc.set_clim(vmin = 350, vmax = 500) 


In [None]:
# Keep only good data flagged zero and replace rest with 'nan' which is "not a number"
flt['Temperature[°C]'] = np.where(flt['QF.2'] == 0 , flt['Temperature[°C]'], np.nan)
flt['Salinity[pss]'] = np.where(flt['QF.3'] == 0 , flt['Salinity[pss]'], np.nan)

sns.distplot(flt['Temperature[°C]'],kde='false')

In [None]:
sns.jointplot(x='Temperature[°C]',y='Salinity[pss]',data=flt,kind='hex')

In [None]:
fltsmall=flt[['pHinsitu[Total]','Salinity[pss]','Temperature[°C]']].copy()

In [None]:
fltsmall

In [None]:
sns.pairplot(fltsmall)
# This will take some time. Notice the "*" that appears to the upper left while the cell runs
# If something is taking longer to run than you think it should, that's called "hanging" and
# It may be due to an error. You can quit that cell by going up to "Kernel" in the menu bar and 
# clicking "interrupt"

In [None]:
# Use this code to subsample the larger dataframe to be used with seaborn grid plots
# a=flt.pivot_table(index='Pressure[dbar]',columns='Station',values='Temperature[°C]')
# a

In [None]:
# Regression plots
# What is the relationsihp between TALK and S? Do you think that Alkalinity can be estimated from just salinity?
sns.lmplot(x='Salinity[pss]',y='TALK_LIAR[µmol/kg]',data=flt)

In [None]:
sns.lmplot(x='Salinity[pss]',y='DIC_LIAR[µmol/kg]',data=flt) #Seaborn linear model plot

In [None]:
# What are these two different blobs? Plot a third variable as a color
flt.plot.scatter(x='Salinity[pss]',y='DIC_LIAR[µmol/kg]',c='Pressure[dbar]',cmap='Purples')

In [None]:
flt_by_station=flt.groupby('Station').mean()

In [None]:
flt_by_station

In [None]:
# To make a contour plot with irregularly-spaced data, you must first define your grid
# What should your grid look like for a section plot?
# The y-axis should be depth, and the x-axis should be time
xi = pd.date_range(flt['date'].min(), flt['date'].max(), freq='10D')
yi = np.linspace(0, 2000, 201)
# Now we need to take the oxygen data and interpolate it onto that grid using its original x's and y's
# Griddata does not want NaNs
flt = flt[np.isfinite(flt['Oxygen[µmol/kg]'])]
grid = np.meshgrid(xi,yi)
from scipy.interpolate import griddata
grid_z = griddata((flt['date'], flt['Pressure[dbar]']), flt['Oxygen[µmol/kg]'], grid, method = 'linear')

In [None]:
# NEXT: Try to use xarray to explore float data
# start with Ryan Abernathy's xarray lesson: 
# https://github.com/rabernat/research_computing/blob/master/content/lectures/python/xarray.ipynb
# Play with cookie-cutter project reproducibility tools from Julius Busecke
# Play with sea-py, mayavi
# http://www.pyngl.ucar.edu/
# https://docs.enthought.com/mayavi/mayavi/installation.html
# https://www.itsonlyamodel.us/argovis-python-api-2.html#section_two

# Machine Learning
# https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/2018JC014629

# importing CMIP6 data https://towardsdatascience.com/a-quick-introduction-to-cmip6-e017127a49d3