# INTRO 
## Objectives: 
* Intro to python libraries
* Downloading netcdf data via ERDDAP 
* Read in / format data
* Navigating/slicing data format

## Python Libraries 
Python uses libraries, or collections of related code/functions for various tasks. <br>
There are some libraries that come pre-installed (see list [here](https://www.geeksforgeeks.org/libraries-in-python/)) <br>
Other libraries need to pip installed first before we can use them. Installation is straightforward and can be done via command line using *pip install {package name}* OR in Jupyter Notebooks via *!pip install {package name}*. <br>

Once a package is installed, we need still need to import it before we can use it. Everytime we start a new notebook or restart the kernel, we are going to need to import all the packages we plan on using.

In [1]:
#to import a package: 
import numpy as np

If we're using a lot of libraries, importing them can take up a lot of space. To combat this, we wrote a "startup file". This is a separate jupyter notebook that imports any library we might use, and then we call that notebook here (see below). The startup file also sets our directories for us.

In [5]:
#check if we've run the startup file already 
import os
try:
    os.chdir(base_dir) #if the start up file has already been run, this will change the dir back to base
except NameError:
    base_dir = None #if the start up file hasn't been run yet, this will allow us to run this cell without crashing
    
#run startup file 
%run ./startup_file.ipynb #run startup file and install libraries if necessary

sklearn [31;32mis already installed :D[0m 
xarray [31;32mis already installed :D[0m 
scipy [31;32mis already installed :D[0m 
joblib [31;32mis already installed :D[0m 
C:\Users\haley.synan


We are currently in our project directory. We can to change to the data directory so we can download files to a project specific location

In [6]:
os.chdir(data_dir) #cd to data directory 
#create new project data folder 
proj_data = '/ECOMON' #name of project
data_dir_fold = data_dir+proj_data #path for new folder 
isexist = os.path.exists(data_dir_fold) #check if path exists 
#print(isexist)
if str(isexist) == 'False': #if path doesn't exist already, make it
    os.mkdir(data_dir_fold)


os.chdir(data_dir_fold) #go to project data folder
os.getcwd() #check if in the correct folder

'C:\\users\\haley.synan\\Documents\\SEASCAPES\\DATA\\ECOMON'

Now we are ready to load and work with data 

## Downloading data 

For this example, we are going to work with some [EcoMon](https://www.fisheries.noaa.gov/feature-story/monitoring-northeast-shelf-ecosystem) data, which are research cruises that collect plankton and hydrographic data from the Northwest Atlantic continental shelf (typically 4x a year)

We are going to use [ERDDAP](https://www.ncei.noaa.gov/products/weather-climate-models/using-erddap#:~:text=The%20Environmental%20Research%20Division%20Data,sources%20into%20a%20single%20workspace.), to download our data.

First, we want to navigate to the EcoMon ERDDAP page, located [HERE](https://comet.nefsc.noaa.gov/erddap/tabledap/ocdbs_v_erddap1.html)
* Fill out UTC_DATETIME, latitude, longitude. <br>
For this example we want UTC_DATETIME: 2022-01-01 to 2022-12-31, Latitude: 34.40918 to 46.362305, Longitude: 
-77.681645 to -63.585942 <br>
* Choose ".nc" for file type and click "just generate URL" <br>
This is the URL we will be calling below.

We will be splitting the url string to isolate the date terms (in this case 2022-01-01 and 2022-12-31) so we can name the file automatically

In [7]:
url = ''.join(["https://comet.nefsc.noaa.gov/erddap/tabledap/ocdbs_v_erddap1.nc?UTC_DATETIME%2Clatitude%2Clongitude%2Cdepth%2Cpressure_dbars%2Csea_water_temperature%2Csea_water_salinity%2Cdissolved_oxygen%2Cfluorescence%2Cpar_sensor%2Ccast_number%2Ccruise_id%2Cpurpose_code%2Cbottom_depth%2CGEAR_TYPE&UTC_DATETIME%3E=2022-01-01&UTC_DATETIME%3C=2022-12-31&latitude%3E=34.40918&latitude%3C=46.362305&longitude%3E=-77.681645&longitude%3C=-63.585942"]) 

def url2date(url, nu): #write function to grab the start and end dates of the data inquiry to use them for naming our data file
    dat = url.split('%')
    s_dat = dat[nu]
    s_dat = s_dat.split('=')
    s_dat = s_dat[1].split('&')
    s_dat = s_dat[0]
    return(s_dat)
 
s_date = url2date(url,nu=15) # for start date: nu = 1
e_date = url2date(url,nu=16) #for end date: nu = 2   
fname = "/EcoMon_" + s_date + '_'+ e_date + ".nc" #create unique filename 
#follow kims naming structure 
#DD8_yyyymmdd_yyyymmdd start end dates #write function to get that parts of URL 
file = data_dir_fold+fname
urllib.request.urlretrieve(url, file) #download data

('C:/users/haley.synan/Documents/SEASCAPES/DATA/ECOMON/EcoMon_2022-01-01_2022-12-31.nc',
 <http.client.HTTPMessage at 0x23258aa8550>)

## NetCDFs

NetCDFs (.nc) are a common filetype used to store and work with multi-dimensional data. They are made up of dimensions, variables, and attributes.
More info on netcdfs can be found [HERE](https://adyork.github.io/python-oceanography-lesson/17-Intro-NetCDF/index.html)



## Reading in data 

In python, there are two main libraries used to read in and work with data: **Pandas** and **NumPy** <br>
Pandas is built ON TOP of NumPy <br>
Pandas reads data in as **dataframes**, which are similar to spreadsheets <br>
NumPy reads data in as **arrays**
You can read more about these two libraries and their main functions [HERE](https://www.codecademy.com/article/introduction-to-numpy-and-pandas)

Because we are reading in a .nc file, we are going to first read in the data using xarray (a library for reading and working with multidimensional data) and then convert it to a pandas dataframe. 
If we were reading in a .csv file, we could read it in using [pandas.read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) The dataframe will look similar to a spreadsheet. <br>
We are also going to convert the xarray to a numpy array. <br>
So now we have the data in three formats: 
* xarray dataset (hence variable name ds)
* pandas dataframe (variable name df)
* numpy array (variable name arr) 

In [8]:
ds = xr.open_dataset(file, decode_cf=True) #read in data using xarray 
ds #print variable

In [9]:
df = ds.to_dataframe()
df

Unnamed: 0_level_0,UTC_DATETIME,latitude,longitude,depth,pressure_dbars,sea_water_temperature,sea_water_salinity,dissolved_oxygen,fluorescence,par_sensor,cast_number,cruise_id,purpose_code,bottom_depth,GEAR_TYPE
row,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,2022-03-05 16:49:00,41.064999,-70.631699,1.0,1.0,4.43,32.771999,,,,1.0,A12201,93.0,45.0,SBE-19+V2
1,2022-03-05 16:49:00,41.064999,-70.631699,2.0,2.0,4.38,32.771000,,,,1.0,A12201,93.0,45.0,SBE-19+V2
2,2022-03-05 16:49:00,41.064999,-70.631699,3.0,3.0,4.32,32.771999,,,,1.0,A12201,93.0,45.0,SBE-19+V2
3,2022-03-05 16:49:00,41.064999,-70.631699,4.0,4.0,4.29,32.771000,,,,1.0,A12201,93.0,45.0,SBE-19+V2
4,2022-03-05 16:49:00,41.064999,-70.631699,5.0,5.0,4.27,32.766998,,,,1.0,A12201,93.0,45.0,SBE-19+V2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
107820,2022-11-15 21:37:00,40.674999,-69.188301,57.5,58.0,16.42,34.028999,,,0.0,207.0,HB2207,10.0,66.0,SBE-911+
107821,2022-11-15 21:37:00,40.674999,-69.188301,58.5,59.0,16.42,34.030998,,,0.0,207.0,HB2207,10.0,66.0,SBE-911+
107822,2022-11-15 21:37:00,40.674999,-69.188301,59.5,60.0,16.42,34.032001,,,0.0,207.0,HB2207,10.0,66.0,SBE-911+
107823,2022-11-15 21:37:00,40.674999,-69.188301,60.5,61.0,16.42,34.033001,,,0.0,207.0,HB2207,10.0,66.0,SBE-911+


In [10]:
arr = df.to_numpy()
arr

array([[Timestamp('2022-03-05 16:49:00'), 41.064998626708984,
        -70.63169860839844, ..., 93.0, 45.0, 'SBE-19+V2'],
       [Timestamp('2022-03-05 16:49:00'), 41.064998626708984,
        -70.63169860839844, ..., 93.0, 45.0, 'SBE-19+V2'],
       [Timestamp('2022-03-05 16:49:00'), 41.064998626708984,
        -70.63169860839844, ..., 93.0, 45.0, 'SBE-19+V2'],
       ...,
       [Timestamp('2022-11-15 21:37:00'), 40.67499923706055,
        -69.18830108642578, ..., 10.0, 66.0, 'SBE-911+'],
       [Timestamp('2022-11-15 21:37:00'), 40.67499923706055,
        -69.18830108642578, ..., 10.0, 66.0, 'SBE-911+'],
       [Timestamp('2022-11-15 21:37:00'), 40.67499923706055,
        -69.18830108642578, ..., 10.0, 66.0, 'SBE-911+']], dtype=object)

## [Accessing and slicing data](https://datacarpentry.org/python-ecology-lesson/03-index-slice-subset.html)
* Python uses **0-based indexing** - meaning the first element is at position 0 (not 1). <br>
* Depending on data type, you can access data using position, integer, or by name. <br>
* **Slicing** is used to select a subset of rows or columns. It is done using [] and position locations. (See examples below) <br>






### Xarray dataset
Xarray datasets are used to store multidimensional data (n-D) arrays. However, they are LABELED arrays (while numpy arrays are not labeled). 
Because we are reading in hydrographic (rather than satellite data), notice there dimensions and coordinates are empty (see ds variable). Coordinate data are stored in the data variables section of the dataset. Because of this we cannot use the .sel() function (for slicing) and will have to slice manually

In [11]:
ds #print varible

In [12]:
#to view a specific variable (also contains attribute data) 
ds.latitude #index by NAME (print latitude data variable) 

In [13]:
#to view ONLY the values in that variable (the VALUES of latitude are stored as np array) 
ds.latitude.values #index by NAME as ARRAY 

array([41.065, 41.065, 41.065, ..., 40.675, 40.675, 40.675], dtype=float32)

In [14]:
ds.latitude.values[0:4] #first 4 values of latitude 

array([41.065, 41.065, 41.065, 41.065], dtype=float32)

In [110]:
np.where(ds.depth <10) #find POSITION of depths less than 10
#np.where is a function for ARRAYS... since ds.depth is stored as an array, we can use this function

(array([     0,      1,      2, ..., 107770, 107771, 107772], dtype=int64),)

### Pandas Dataframe
Dataframes are specific for 2-d tabular data (think spreadsheet). Dataframes typically use more storage than a np array. <br>

There are several ways to navigate and index (slice) through the data. Using a dataframe, you can dot-index using *df.variablename* <br>
You also can use the function *iloc* for indexing based on position 

You can index based on position (using iloc) using [row,column] notations <br>
Using a colon, "**:**", means all, so [:,1] means all rows, 1st columns

In [15]:
df.latitude #index by NAME 

row
0         41.064999
1         41.064999
2         41.064999
3         41.064999
4         41.064999
            ...    
107820    40.674999
107821    40.674999
107822    40.674999
107823    40.674999
107824    40.674999
Name: latitude, Length: 107825, dtype: float32

In [16]:
df.latitude.values #index by NAME as ARRAY 

array([41.065, 41.065, 41.065, ..., 40.675, 40.675, 40.675], dtype=float32)

In [76]:
df.iloc[:,1:2] #index by POSITION...show columns 2&3 but all rows 

Unnamed: 0_level_0,latitude
row,Unnamed: 1_level_1
0,41.064999
1,41.064999
2,41.064999
3,41.064999
4,41.064999
...,...
107820,40.674999
107821,40.674999
107822,40.674999
107823,40.674999


In [17]:
df.iloc[0:4,:] #show first 4 rows, all columns

Unnamed: 0_level_0,UTC_DATETIME,latitude,longitude,depth,pressure_dbars,sea_water_temperature,sea_water_salinity,dissolved_oxygen,fluorescence,par_sensor,cast_number,cruise_id,purpose_code,bottom_depth,GEAR_TYPE
row,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,2022-03-05 16:49:00,41.064999,-70.631699,1.0,1.0,4.43,32.771999,,,,1.0,A12201,93.0,45.0,SBE-19+V2
1,2022-03-05 16:49:00,41.064999,-70.631699,2.0,2.0,4.38,32.771,,,,1.0,A12201,93.0,45.0,SBE-19+V2
2,2022-03-05 16:49:00,41.064999,-70.631699,3.0,3.0,4.32,32.771999,,,,1.0,A12201,93.0,45.0,SBE-19+V2
3,2022-03-05 16:49:00,41.064999,-70.631699,4.0,4.0,4.29,32.771,,,,1.0,A12201,93.0,45.0,SBE-19+V2


In [79]:
df.latitude[0:4] #first 4 rows of latitude 

row
0    41.064999
1    41.064999
2    41.064999
3    41.064999
Name: latitude, dtype: float32

In [98]:
df[df.depth == 1] #find all rows with depth = 1
#NOTE: use "=" for defining a variable, use "==" for checking equality between two objects 

Unnamed: 0_level_0,UTC_DATETIME,latitude,longitude,depth,pressure_dbars,sea_water_temperature,sea_water_salinity,dissolved_oxygen,fluorescence,par_sensor,cast_number,cruise_id,purpose_code,bottom_depth,GEAR_TYPE
row,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,2022-03-05 16:49:00,41.064999,-70.631699,1.0,1.0,4.430000,32.771999,,,,1.0,A12201,93.0,45.0,SBE-19+V2
44,2022-03-05 18:54:00,41.064999,-70.589996,1.0,1.0,4.540000,32.751999,,,,2.0,A12201,93.0,44.0,SBE-19+V2
87,2022-03-05 20:43:00,41.064999,-70.533302,1.0,1.0,4.490000,32.775002,,,,3.0,A12201,93.0,45.0,SBE-19+V2
131,2022-03-05 21:53:00,41.134998,-70.401703,1.0,1.0,4.380000,32.742001,,,,4.0,A12201,93.0,39.0,SBE-19+V2
169,2022-03-11 15:02:00,40.994999,-70.440002,1.0,1.0,4.950000,32.791000,,,,5.0,A12201,93.0,43.0,SBE-19+V2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
107264,2022-11-14 10:23:00,41.981701,-70.184998,1.0,1.0,13.640000,32.048000,,,39.0,200.0,HB2207,10.0,34.0,SBE-911+
107365,2022-11-15 14:00:00,40.891701,-68.956703,1.0,1.0,16.709999,34.069000,,,,114.0,HB2207,10.0,71.0,SBE-19+V2
107430,2022-11-15 16:58:00,40.688301,-68.956703,1.0,1.0,16.740000,33.896000,,,,115.0,HB2207,10.0,68.0,SBE-19+V2
107494,2022-11-15 20:10:00,40.766701,-69.153297,1.0,1.0,16.090000,33.867001,,,,116.0,HB2207,10.0,72.0,SBE-19+V2


### NP Array 
Arrays can be a n-D data structure. They are pretty general purpose, and may be faster than a dataframe (if working with really large data). 
Because this array doesnt have column headers, we have to index based on position ONLY. <br> Ex) instead of calling latitude, we have to call the first column. 

In [89]:
lat = arr[:,1] #second COLUMN (latitude) 
lat

array([41.064998626708984, 41.064998626708984, 41.064998626708984, ...,
       40.67499923706055, 40.67499923706055, 40.67499923706055],
      dtype=object)

In [93]:
arr[0] #first row 

array([Timestamp('2022-03-05 16:49:00'), 41.064998626708984,
       -70.63169860839844, 1.0, 1.0, 4.429999828338623, 32.77199935913086,
       nan, nan, nan, 1.0, 'A12201', 93.0, 45.0, 'SBE-19+V2'],
      dtype=object)

## OTHER FUNCTIONS 

* len() - find length of variable
* type() - find type of variable


In [18]:
len(df)

107825

In [112]:
len(ds.latitude)

107825

In [116]:
type(ds.latitude) #stored as x-array

xarray.core.dataarray.DataArray

In [19]:
type(ds.latitude.values) #note the VALUES of the ds.latitude are stored as np array 

numpy.ndarray

In [119]:
type(arr)

numpy.ndarray