# W2: NumPy, pandas, xarray

- Contributer: Dr. Zhonghua Zheng, Yuan Sun
- Course Unit: Earth and Environmental Data Science (EART60702)
- Last modified date: 4 February, 2024

## Intended Learning Outcomes (ILOs)
- NumPy and Pandas: understand and apply NumPy for numerical computations, leveraging vectorization and broadcasting, along with pandas for handling and analyzing data.
- Time Series Analysis and Visualization: employ pandas for time series data analysis using datetime functionalities and visualize the results with pandas' built-in plotting tools.
- xarray: conduct data operations on multidimensional datasets using Xarray, integrating it seamlessly with NumPy and pandas workflows.

## 1. NumPy and pandas (20 mins)
**NumPy:**
- NumPy (Numerical Python) is the fundamental package for scientific computing in Python: https://numpy.org/doc/stable/.
- NumPy is an open source Python library that’s used in almost every field of science and engineering. It’s the universal standard for working with numerical data in Python, and it’s at the core of the scientific Python and PyData ecosystems: https://numpy.org/doc/stable/user/absolute_beginners.html.
- NumPy can be used to perform a wide variety of mathematical operations on arrays.

**pandas:**
- pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive: https://pandas.pydata.org/docs/getting_started/overview.html
- When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.
- The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering.

In [None]:
# import package
import numpy as np
import pandas as pd
print(np.__version__)
print(pd.__version__)

Q1: how create an arithmetic array using np.arrange()

### 1.1 reshaping, indexing, and slicing

In [None]:
arr = np.arange(24)
print(arr.shape)
arr

In [None]:
# reshape
arr = arr.reshape(6,4)
print(arr.shape)
arr

In [None]:
sliced_arr0 = arr[1,1] 
sliced_arr0

In [None]:
sliced_arr0 = len(arr[arr>5])
sliced_arr0

In [None]:
sliced_arr0 = arr[1:] # Rows 1 to the end
sliced_arr0

In [None]:
sliced_arr1 = arr[1:4] # Rows 1 to 3
sliced_arr1

In [None]:
sliced_arr1_1 = arr[1:4, :] # Rows 1 to 3, All columns
sliced_arr1_1

In [None]:
sliced_arr2 = arr[1:4,2:4] # Rows 1 to 3, Columns 2 to 3
sliced_arr2

In [None]:
sliced_arr3 = arr[1:4, 2] # Rows 1 to 3, Columns 2
sliced_arr3

In [None]:
sliced_arr4 = arr[1:4:2] # Rows 1 and Rows 3
sliced_arr4

### 1.2 find the indices (row and column)

In [None]:
arr

In [None]:
# find a certain value
value_to_find = 15
(row_indices, col_indices) = np.where(arr == value_to_find)
print(row_indices, col_indices)

How to find the indices of the maximal value?

### 1.3 vectorization

In [None]:
arr = np.random.rand(1000000)

In [None]:
%%time
squares_loop = [x**2 for x in arr]

In [None]:
%%time 
squares_vectorized = arr**2

### 1.4 broadcasting

In [None]:
a = np.random.rand(100000000)
b1 = 5
b2 = np.full(100000000, 5)
b2

In [None]:
%%time
c1 = a*b1
c1

In [None]:
%%time
c2 = a*b2
c2

### 1.5 create series and dataframe

In [None]:
# create a date as a series
dates = pd.date_range('20240208', periods = 6)
dates

In [None]:
# create a dataframe with series
df = pd.DataFrame(np.random.randn(6,4), 
                  index = dates, 
                  columns=["a", "b", "c", "d"])
df

In [None]:
# show the index of a dataframe
df.index

In [None]:
# quickly describe the dataframe
df.describe()

In [None]:
# transposition
df.T

### 1.6 questions

Q1: How to convert 1-D array into 2-D array?
- https://numpy.org/doc/stable/user/absolute_beginners.html

Q2: How to calculate the mean square error?
- Mean squere error is an important metric in regression analysis.
```
y_pred = np.array([1.0, 2.0, 3.0, 4.0])
y_true = np.array([1.1, 1.9, 3.1, 3.9])
```
- https://numpy.org/doc/stable/user/absolute_beginners.html


Q3: How select the "a" column for the date "2024-02-08" in `df`?

Q4: How to get the positive elements from `df`?

## 2. Time Series Analysis and Visualization (15 mins)

### 2.1 basic datetimes

In [None]:
np.array(['2024-02-08', '2024-02-09', '2024-02-10'], dtype='datetime64')

### 2.2 parsing time series information from various sources 

In [None]:
import datetime
dti = pd.to_datetime(["02/01/2024", 
                      np.datetime64("2024-02-02"), 
                      datetime.datetime(2024, 2, 3)])
dti

In [None]:
ts = pd.Series(np.random.randn(29), index=pd.date_range("2024-02-01", "2024-02-29"))

### 2.3 questions
Q1: please create a NumPy array that include all the dates for Feb 2024
- https://numpy.org/doc/stable/reference/arrays.datetime.html

Q2: please create a pandas Series that include all the dates for Feb 2024
- https://pandas.pydata.org/docs/user_guide/timeseries.html

Q3: please use `np.random.randn` to create a pandas Series (Y axis) and the results from Q2 as the index (X axis), and produce a line plot and a scatter plot
- https://pandas.pydata.org/docs/user_guide/visualization.html

## 3. xarray (15 mins)
- xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience: https://docs.xarray.dev/en/stable/index.html
- Compared with Numpy-like array, xarray introduces labels in the form of dimensions.

In [None]:
import xarray as xr
xr.__version__

### 3.1 create a DataArray 
- data, dimensions(optional), coordinates(optional)

In [None]:
# create a xarry
data = np.random.randn(3,4,3) # dreate data
lat = [-20, -10, 10, 20]
lon = [10, 20, 30]
time = pd.date_range("2023-01-01", periods=3)
array = xr.DataArray(data, coords = [time, lat, lon], 
                     dims=['time', 'lat', 'lon'], 
                     name = "foo") # 3D array ('time', 'lat', and 'lon' are the dimension names)
array

### 3.2 indxing and selecting data

In [None]:
array[2:]

In [None]:
array.sel(lon=10)

### 3.3 deal with NetCDF

In [None]:
# Export a netcdf file
array.to_netcdf('output.nc')

In [None]:
# read in a netcdf file
ds=xr.open_dataset('output.nc')
ds

### 3.4 check NetCDF basic information

In [None]:
ds.dims

In [None]:
ds.attrs

In [None]:
ds.coords

In [None]:
ds.data_vars

### 3.5 questions
Q1: please provide a figure of the `foo`, where X-axis is `lon`, Y-axis is `lat`, and the value are the mean value
- https://docs.xarray.dev/en/stable/generated/xarray.DataArray.mean.html

Q2: please provide a figure of the `foo`, where X-axis is the time, Y-axis is the mean value
- https://docs.xarray.dev/en/stable/generated/xarray.DataArray.mean.html

## Project 1

Please fork the repo (https://github.com/m-edal/Earth-Env-DS-MSc-Course/tree/main), and add your project description to the `README.md` file