# Chapter 5 - Working with data: pandas and xarray

Chapter 4 will cover two libraries that are essential to ocean data analysis: __pandas__ and __xarray__. In Chapter 4 we will cover the basics of __xarray__ with examples.

Although we show complete examples here, we invite you to edit and rerun them to better grasp their functionality.

***
<img src='./figures/pandas_logo.png'>

## pandas   
    
__pandas__ is a `Python` package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language. It is already well on its way toward this goal.

The two primary data structures of __pandas__, Series (1-dimensional) and DataFrame (2-dimensional). __pandas__ is built on top of __NumPy__ and is intended to integrate well within a scientific computing environment with many other 3rd party libraries. For this reason, when importing __pandas__, we will also import __numpy__, so we can use all their methods. 

Many of these principles are here to address the shortcomings frequently experienced using other languages / scientific research environments. For data scientists, working with data is typically divided into multiple stages: munging and cleaning data, analyzing / modeling it, then organizing the results of the analysis into a form suitable for plotting or tabular display. __pandas__ is the ideal tool for all of these tasks.

__Test this:__ Run the next cell to import these libraries. We are importing them using their conventional nickname - although feel free to choose yours. Note that when you run an importing cell, no output is displayed other than a number betwen [ ] on the left side of the cell.


In [1]:
#typical naming convertions for both packages
import numpy as np 
import pandas as pd

# this library helps to make your code execution less messy
import warnings
warnings.simplefilter('ignore') # filter some warning messages

### Reading in data

__pandas__ can be useful to read in both .csv and .txt files. 

The cell below will show you the main function for both document types. Note that the data is stored as a 2D dataframe and each the title for each column became the column name

In [None]:
#read in data
buoy = pd.read_csv('data/BuoyData.csv') #same function also works for .txt files

#look at the type of data structure
print(type(buoy))

#view first five rows of data with head function
buoy.head()

### Indexing and Slicing Data

Data can be called by the column names and stored as a new variable. Notice if one column is selected, that data now becomes a 1D series. 

You can also select out mulitple columns by creating a list within the indexing. 



In [None]:
#look at a certain column 
print(buoy['ID'].head())
#same as: buoy.ID

#select out a certain column
wind = buoy['Wind']
print(wind.head())

#look at the type of data structure for wind
print(type(wind))

#select out two or more columns
wind_temp = buoy[['Wind','Temp']] #note the [[]] for selecting multiple columns
print(wind_temp.head())

You can also filter data by rows using conditions and column names. In the cell below you can select out all of the rows were the salinity is greater than or equal to 39. 

Renaming colomuns in a dataframe can be done with the rename function and you give a dictionary of the old column name and the new one. 

Calculating new columns in __pandas__ is easy too. You select a column with the new name you want and then preform the calculations you want. 

In [None]:
#conditionally filter out rows
sal_32 = buoy[buoy['Salinity'] >= 39]
print(sal_32.head())

#rename a column
buoy.rename(columns = {'Temp':'Temp_C'}, inplace = True)
print("Buoy Column Names:", buoy.columns)

#create a new column
buoy['Temp_F'] = np.around((buoy['Temp_C'] * (9/5)) + 32,2) #the around function from numpy rounds the data
print(buoy.head())

### Saving Data

Data can be exported using the to_csv function. The index argument will determine if the index is exported with the data (the column before the first column with a value for each of the rows). By default this argument is set to true. 

In [None]:
#exporting pandas data
buoy.to_csv('data/buoy_updated.csv', index = False)

***
<img src='./figures/xarray_logo.png'>

## xarray   
    
__xarray__ is an open source `Python` library designed to handle (read, write, analyze, visualize, etc.) sets of labeled multi-dimensional arrays and metadata common in _(Earth)_ sciences. Its data structure, the __Dataset__, is built to reflect a netcdf file. __xarray__ was built on top of the __pandas__ library, which processes labeled tabular data, inheriting several of its methods and functionality.

For this reason, when importing __xarray__, we will also import __numpy__ and __pandas__, so we can use all their methods. 

__Test this:__ Run the next cell to import these libraries. We are importing them using their conventional nickname - although feel free to choose yours. Note that when you run an importing cell, no output is displayed other than a number betwen [ ] on the left side of the cell.


In [2]:
import xarray as xr

#these two packages were already loaded in above 
#import numpy as np 
#import pandas as pd


###  Reading and exploring Data Sets
    
__Run the next cell:__  Let's start by reading and exploring the content of a `netcdf` file located locally. __It is so easy!__

Once the content is displayed, you can click on the file and disk icons on the right to get more details on each parameter.

Also note that the __data array__ or __variable__ _(SST)_ has 3 __dimensions__ _(latitude, longitude and time)_ , and that each dimension has a data variable (__coordinate__) associated with it. Each variable as well as the file as a whole has metadata denominated __attributes__.

In [3]:
ds = xr.open_dataset('./data/HadISST_sst_2000-2020.nc') # read a local netcdf file
ds.close() # close the file, so can be used by you or others. it is good practice.
ds  # display the content of the dataset object

Sometimes you might need to read in many files at once. The mfdataset function from __xarray__ allows you to open only certain files and combine them into a single varaible. The three files that are opened in the next cell are from a satellite where each time stamp is stored as it's own variable. 

In [4]:
ds_mf = xr.open_mfdataset('data/*Thetao*.nc'.format(dir), engine = 'netcdf4') #engine argument to specify file type
ds_mf.close()
ds_mf

Unnamed: 0,Array,Chunk
Bytes,332.83 MiB,110.94 MiB
Shape,"(3, 31, 601, 1561)","(1, 31, 601, 1561)"
Count,9 Tasks,3 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 332.83 MiB 110.94 MiB Shape (3, 31, 601, 1561) (1, 31, 601, 1561) Count 9 Tasks 3 Chunks Type float32 numpy.ndarray",3  1  1561  601  31,

Unnamed: 0,Array,Chunk
Bytes,332.83 MiB,110.94 MiB
Shape,"(3, 31, 601, 1561)","(1, 31, 601, 1561)"
Count,9 Tasks,3 Chunks
Type,float32,numpy.ndarray


__xarray__ can also read data online. We are going to learn how read data from the cloud in the application chapters, but for now, we will exemplify __xarray__ and `Python` capability of reading from an online file. __Run the next cell__ to do so.

In [None]:
# assign a string variable with the url address of the datafile
url = 'https://podaac-opendap.jpl.nasa.gov/opendap/allData/ghrsst/data/GDS2/L4/GLOB/CMC/CMC0.2deg/v2/2011/305/20111101120000-CMC-L4_GHRSST-SSTfnd-CMC0.2deg-GLOB-v02.0-fv02.0.nc'
ds_sst = xr.open_dataset(url) # reads same way as local files!
ds_sst

###  Visualizing data
    
An image is worth a thousand _attributes_ ! Sometimes what we need is a quick visualization of our data, and __xrray__ is there to help. In __the next cells__, visualization for both opened datasets are shown. 

In [None]:
ds_sst.analysed_sst.plot() # note that we needed to choose one of the variable in the Dataset to be displayed


In [None]:
ds.sst[0,:,:].plot() # we choose a time to visualize the spatial data (lat, lon) at that time (zero or the first time entry)


#### Yes! it is that easy! 
Although we'll get more sophisticated in the Chapter 4b.

### Some basic methods of Dataset
   
__xarray__ also lets you operate over the dataset in a simple way. Many operations are built as methods of the Dataset class that can be accessed by adding a `.` after the Dataset name. __Test this:__ In the next cell, we access the _averaging_ method to make a time series of sea surface temperature over the entire globe and display it. __All in one line!__

In [None]:
ds.sst.mean(dim=['latitude','longitude']).plot() # select a variable and average it
# over spatial dimensions, and plot the final result


### Selecting data

Sometimes we want to visualize or operate only on a portion of the data. __In the next cell__ we demonstrate the method `.sel`, which selects data along dimensions, in this case specified as a range of the coordinates using the function _slice_.

In [None]:
ds.sst.sel(time=slice('2012-01-01','2013-12-31')).mean(dim=['time']).plot() # select a period of time

In [None]:
ds.sst.sel(latitude=slice(50,-50)).mean(dim=['time']).plot() # select a range of latitudes. 
# note that we need to go from 50 to -50 as the laitude coordinate data goes from 90 to -90

Another useful way to select data is the method __.where__, which instead of selecting by a coordinate, selects using a condition over the data or the coordinates. __Test this:__ In the next cell we extract the _ocean mask_ contained in the NASA surface temperature dataset.

In [None]:
ds_sst.analysed_sst.where(ds_sst.mask==1).plot() # we select, using .where, the data in the variable 'mask' that is equal to 1, 
# applied it to the variable 'analysed_sst', and plot the data.  
# Try changing the value for mask - for example 2 is land, 8 is ice.

### Operating between two Data Arrays
    
__In the next__ example we compare two years of temperature. We operate over the same Data Array, but we averaging over 2015 in the first line, and over 2012 in the second line. Each `.sel` operation returns a new Data Array. We can subtract them by using simple `-`, since they have the same dimensions and coordinates. At the end, we just plot the result. __It is that simple!__

In [None]:
# comparing 2015 and 2012 sea surface temperatures
(ds.sst.sel(time=slice('2015-01-01','2015-12-31')).mean(dim=['time'])
-ds.sst.sel(time=slice('2012-01-01','2012-12-31')).mean(dim=['time'])).plot() # note that in this case i could split the line in two
# makes it easier to read

We will cover more examples of methods and operations over datasets in the following chapters. But if you want to learn more, and we recommend it, given the many awesome capabilities of xarray, please look at the __Resources__ section below. 

***

### Saving your Datasets and DataArrays
There is one more thing you should learn here. In the applications chapters we go from obtaining the data to analyzing and producing a visualization. But sometimes, we want to save the data we acquire to process later, in a different script, or in the same but not have to download it every time. 

__The next cell__ shows you how to do so in two simple steps:

- Assign the outcome of an operation to a variable, which will be a new dataset or data array object
- Save it to a new `netcdf` file

In [None]:
# same operation as before, minus the plotting method
my_ds = (ds.sst.sel(time=slice('2015-01-01','2015-12-31')).mean(dim=['time'])-ds.sst.sel(time=slice('2012-01-01','2012-12-31')).mean(dim=['time']))
# save the new dataset `my_ds` to a file in the directory data
my_ds.to_netcdf('./data/Global_SST_2015-2012.nc')
# explore the content of `my_ds`. note that the time dimension does not exist anymore
my_ds

*** 

## Resources

[The __pandas__ offical site](https://pandas.pydata.org/docs/index.html).

[The __xarray__ official site](http://xarray.pydata.org/en/stable/).

Useful [documentation](https://docs.xarray.dev/en/stable/user-guide/plotting.html) for basic __xarray__ plotting.

Great [introduction](https://www.youtube.com/watch?v=Dgr_d8iEWk4&t=908s) to __xarray__ capabilities.

If you really want to dig deep watch this [video](https://www.youtube.com/watch?v=ww4EYv20Ucw).

A step-by-step [guide](https://rabernat.github.io/research_computing_2018/xarray.html) to __xarray__ handling of netcdf files, and many of the methods seeing here, like `.sel` and `.where`.

### More on:

Sometimes, the best way to learn how to do something is go directly to the reference page for a function or method. There you can see what arguments, types of data, and outputs to expect. Most of the time, they have useful examples:

- Method [__.groupby( )__](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html)

- Method [__.where( )__](http://xarray.pydata.org/en/stable/generated/xarray.DataArray.where.html)

- Method [__.sel( )__](http://xarray.pydata.org/en/stable/generated/xarray.DataArray.sel.html)

- Method [__.mean( )__](http://xarray.pydata.org/en/stable/generated/xarray.DataArray.mean.html)
