# The Scientific Python Ecosystem

The Scientific Python Ecosystem is made up of a robust collection of packages that provide functionality for everything from simple numeric arrays to sophisticated machine learning algorithms. In this notebook, we'll introduce the core scientific python packages and some important terminology.

![](./images/stack.png)

### Outline
- Python
- Numpy
- Scipy
- Pandas
- Xarray

### Tutorial Duriation
10 minutes

### Going Further

This notebook is just meant to make sure we all have the same base terminology before jumping into our tutorial. If you are new to Python or just want to brush up, you may be interested in the following online resources:

- Scientific Python Lectures: http://scipy-lectures.org/
- Numpy Tutorial: https://docs.scipy.org/doc/numpy-dev/user/quickstart.html
- Scipy Tutorial: https://docs.scipy.org/doc/scipy/reference/tutorial/index.html
- Pandas Tutorials: https://pandas.pydata.org/pandas-docs/stable/tutorials.html
- Xarray Tutorial: https://geohackweek.github.io/nDarrays/

## Python built-ins

In [None]:
# data types
x = 4  # integer
type(x)

In [None]:
pi = 3.14  # float
type(pi)

In [None]:
name = 'my string'  # a string type
type(name)

In [None]:
# data structures / objects

my_list = [2, 4, 10]  # a list

my_list[2]  # access by position

In [None]:
my_dict = {'pi': 3.14, 'd': 4}  # a dictionary


my_dict['pi']  # access by key

## Numpy

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

- a powerful N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transform, and random number capabilities

Numpy Documentation: https://docs.scipy.org/doc/numpy/

In [None]:
import numpy as np

In [None]:
# create a 4x5 array of zeros
x = np.zeros(shape=(4, 5))
x

In [None]:
# add an integer to the array of zeros
y = x + 4
y

In [None]:
# create an array of random numbers
z = np.random.random(x.shape)
z

In [None]:
# aggregate along an axis of the array
z_sum = z.sum(axis=1)
z_sum

In [None]:
# numpy performs some fancy broadcasting and dimension alignment
y.transpose() * z_sum

In [None]:
# slicing (or selecting a subset of an array)
z[2:4, ::2]  # 2-4 on the first axis, stride of 2 on the second

In [None]:
# data types

xi = np.array([1, 2, 3], dtype=np.int)  # integer
xi.dtype

In [None]:
xf = np.array([1, 2, 3], dtype=np.float)  # float
xf.dtype

In [None]:
# universal functions (ufuncs, e.g. sin, cos, exp, etc)
np.sin(z_sum)


### SciPy

SciPy is a collection of mathematical algorithms and convenience functions built on the Numpy extension of Python. It adds significant power to the interactive Python session by providing the user with high-level commands and classes for manipulating and visualizing data. SciPy includes a number of subpackages covering different scientific computing domains:

| Subpackage | Description|
| ------| ------|
| cluster |	Clustering algorithms|
| constants |	Physical and mathematical constants|
| fftpack |	Fast Fourier Transform routines|
| integrate |	Integration and ordinary differential equation solvers|
| interpolate |	Interpolation and smoothing splines|
| io |	Input and Output|
| linalg |	Linear algebra|
| ndimage |	N-dimensional image processing|
| odr |	Orthogonal distance regression|
| optimize |	Optimization and root-finding routines|
| signal |	Signal processing|
| sparse |	Sparse matrices and associated routines|
| spatial |	Spatial data structures and algorithms|
| special |	Special functions|
| stats |	Statistical distributions and functions

Because SciPy is built directly on Numpy, we'll skip any examples for now. The SciPy API is well documented with examples how to use specific subpackages.

SciPy Documentation: 

### Pandas

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.

Pandas Documentation: http://pandas.pydata.org/pandas-docs/stable/

In [None]:
import pandas as pd

In [None]:
# This data can also be loaded from the statsmodels package
# import statsmodels as sm
# co2 = sm.datasets.co2.load_pandas().data 

co2 = pd.read_csv('./data/co2.csv', index_col=0, parse_dates=True)

In [None]:
# co2 is a pandas.DataFrame
co2.head()  # head just prints out the first few rows

In [None]:
# The pandas DataFrame is made up of an index
co2.index

In [None]:
# and 0 or more columns (in this case just 1 - co2)
# Each column is a pandas.Series
co2['co2'].head()  


In [None]:
# label based slicing
co2['1990-01-01': '1990-02-14']

In [None]:
# aggregations just like in numpy
co2.mean(axis=0)

In [None]:
# advanced grouping/resampling

# here we'll calculate the annual average timeseris of co2 concentraions
co2_as = co2.resample('AS').mean()  # AS is for the start of each year

co2_as.head()

In [None]:
# we can also quickly calculate the monthly climatology

co2_climatology = co2.groupby(co2.index.month).mean()
co2_climatology

In [None]:
%matplotlib inline

# and even plot that using pandas and matplotlib
co2_climatology.plot()

### Xarray

xarray is an open source project and Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun!

Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like arrays, which allows for a more intuitive, more concise, and less error-prone developer experience. The package includes a large and growing library of domain-agnostic functions for advanced analytics and visualization with these data structures.

Xarray was inspired by and borrows heavily from pandas, the popular data analysis package focused on labelled tabular data. It is particularly tailored to working with netCDF files, which were the source of xarray's data model, and integrates tightly with dask for parallel computing.

Xarray Documentation: http://xarray.pydata.org/en/stable/

![](./images/xarray.png)


In [None]:
import xarray as xr

In [None]:
# Open some sample data, this could also be accessed via: 
# ds = xr.tutorial.open_dataset('air_temperature')
ds = xr.open_dataset('./data/air_temperature.nc')

Opening a NetCDF file gives us an `xarray.Dataset`.
These have dimensions, coordinates, data variables, and attributes.

- Dimensions are the names of each axis (e.g. axis 0 is the time dimension)
- Cooridinates are like Pandas Indexes (described above)
- Data variables are where the actual data in a dataset lives
- Attributes are the metadata describing the dataset.

In [None]:
print(ds)

Each data variable is a xarray.DataArray

In [None]:
ds['air']

Xarray has a robust tool kit for doing analysis of your data. A few examples:

In [None]:
# take the mean along the time dimension
ds_mean = ds.mean(dim='time')
ds_mean

In [None]:
# indexing using coordinates
ds.sel(time=slice('2013-03-01', '2013-06-01'))

In [None]:
# resample the air_temperature data to monthly means
ds['air'].resample(time='1MS').mean(dim='time')

In [None]:
# make a plot of the standard deviation of air temperature
ds['air'].std(dim='time').plot()

In [None]:
# mask out where its cold or hot
da = ds['air'].sel(time='2013-05-01 T18:00:00')
da.where((da > 273) & (da < 295)).plot()

In [None]:
# calculate the mean temperature for each season using groupby
da_season = ds['air'].groupby('time.season').mean(dim='time')
da_season

In [None]:
# and plot that using xarray's facet grid feature
da_season.plot(col='season', col_wrap=2)