# W1: Review of Python: NumPy, Pandas, and Basic Operation of Environmental Datasets

- Contributers: Dr. Zhonghua Zheng, Yuan Sun
- Course Unit: Earth and Environmental Data Science (EART60702)
- Last modified date: 29 January, 2024

## Intended Learning Outcomes (ILOs)
- Numpy Array Proficiency: learn to create and manipulate Numpy arrays, understanding both one-dimensional and multi-dimensional array operations.
- Data Handling with Pandas: gain skills in importing, exporting, and manipulating data using Pandas DataFrames.
- Environmental Data Analysis: apply Numpy and Pandas skills to analyze environmental datasets and use these analyses in a project context.

## 1. Numpy (15 mins)
- NumPy (Numerical Python) is the fundamental package for scientific computing in Python: https://numpy.org/doc/stable/.

- NumPy is an open source Python library that’s used in almost every field of science and engineering. It’s the universal standard for working with numerical data in Python, and it’s at the core of the scientific Python and PyData ecosystems: https://numpy.org/doc/stable/user/absolute_beginners.html.

- NumPy can be used to perform a wide variety of mathematical operations on **arrays**.

In [None]:
# import package
import numpy as np

In [None]:
# check numpy version
np.__version__

### 1.1 ways to create a numpy 1-D array

In [None]:
# way1
a = np.array([1, 2, 3])
a

In [None]:
# way2
b = np.zeros(2)
b

In [None]:
# way3
c = np.ones(2)
c

In [None]:
# way4
d = np.arange(4)
d

In [None]:
# to specify the dtype as float, int, etc
# search more details on how to use np.arange() in : https://numpy.org/doc/stable/reference/generated/numpy.arange.html#numpy-arange
d_f = np.arange(4, dtype=float)
d_f

In [None]:
# way5
e = np.arange(2, 9, 2)
e

In [None]:
# way6
f = np.linspace(0, 10, num=5)
f

In [None]:
# way7: create an array from existing data
a0 = np.array([1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
a1 = a0[3:8]
a1

### 1.2 2-D or Muti-D array

In [None]:
a1D = np.array([1, 2, 3, 4])
a2D = np.array([[1, 2], [3, 4]])
a3D = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

a3D

### 1.3 basic array operation

In [None]:
d1 = np.array([1, 2])
d2 = np.ones(2, dtype=int)

In [None]:
# addition
d1 + d2

In [None]:
# substraction
d1 - d2

In [None]:
# broadcasting
d1 * 1.6

In [None]:
# sum
d1.sum()

In [None]:
# max
d1.max()

In [None]:
# min
d1.min()

Note:
- the product operator `*` operates elementwise in NumPy arrays
- the matrix product can be performed using the `@` operator (in python >=3.5) or the `dot` function or method

In [None]:
a = np.array([[1, 0],
              [0, 1]])
b = np.array([[4, 1],
              [2, 2]])

In [None]:
a * b

In [None]:
np.multiply(a, b)

In [None]:
a @ b

In [None]:
np.matmul(a, b)

In [None]:
a.dot(b)

In [None]:
np.dot(a, b)

##  2. Pandas (15 mins)
- pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive: https://pandas.pydata.org/docs/getting_started/overview.html
- When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a `DataFrame`.
- The two primary data structures of pandas, `Series` (1-dimensional) and `DataFrame` (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. 

In [None]:
import pandas as pd

In [None]:
# check pandas version
pd.__version__

### 2.1 importing and exporting data

Here let's learn how create a dataframe

In [None]:
# create a datafrme like a dictionary
df0 = pd.DataFrame(
    {"A": 1.0,
     "B": pd.Timestamp("20240128"),
     "C": pd.Series(1, index=list(range(4)), dtype = float),
     "D": pd.Categorical(["test", "train","foo","test"])
     }
)
df0

In [None]:
# create a dataframe from a numpy array
a = np.array([[-2.58289208,  0.43014843, -1.24082018, 1.59572603],
              [ 0.99027828, 1.17150989,  0.94125714, -0.14692469],
              [ 0.76989341,  0.81299683, -0.95068423, 0.11769564],
              [ 0.20484034,  0.34784527,  1.96979195, 0.51992837]])

df = pd.DataFrame(a)
df.head()

export a dataframe to a csv file. Why `index=False`? https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

In [None]:
df0.to_csv("foo.csv", index=False)

read the csv file that we exported

In [None]:
# read a CSV
# you should upload a csv 
df1 = pd.read_csv('foo.csv')

# if you want to specify the path
# path = 'XXX/XXX/XXX/XXX'
# df.to_csv(path + 'pd.csv')

In [None]:
df1

### 2.2 Manipulating DataFrames

In [None]:
# name the dataframe column
name = ['first_column', 'second_column', 'third_column', 'fourth_column']
df2 = pd.DataFrame(a, columns = name)
df2.head()

In [None]:
# add a new column to an existing dataframe
df2['fifth_column'] = ['Hi', 'Hello', 'bonjour', 'nihao']
df2

In [None]:
# select a specific column, where each column is a series
df2_first_column = df2['first_column']
df2_first_column.head() # df2_first_column is a pandas series

In [None]:
# select more than one column
df2_multi_column = df2[['first_column', 'third_column']]

df2_multi_column.head() # df2_multi_column is a dataframe

In [None]:
# know the shape of a dataframe
df2.shape

In [None]:
# filter rows 
above_1 = df2[df2['second_column']>1] # select rows whose 'second_column' value>1
above_1

optional: dealing with a xlsx file

In [None]:
# read a xlsx
y = pd.read_excel('sample.xlsx', sheet_name = 'sheet1')

# export a dataframe to a xlsx
df.to_excel('pd.xlsx', sheet_name = 'export')

## 3. Basic Operation of Environmental Datasets (15 mins)

In [None]:
!wget https://storage.googleapis.com/tensorflow/tf-keras-datasets/jena_climate_2009_2016.csv.zip

In [None]:
!unzip jena_climate_2009_2016.csv.zip

In [None]:
df = pd.read_csv("jena_climate_2009_2016.csv")
df

Then you can follow here: https://www.tensorflow.org/tutorials/structured_data/time_series to perform analysis

## 4. Data for Project 1 (15 mins)

Please don't the data here: https://www.dropbox.com/scl/fi/azzx0olpeyx45rixlsgdn/project_1.csv?rlkey=b4fj8cnmc4ytyezppfbhpky3t&dl=0

The definitions of **some** variables are available here: https://www.cesm.ucar.edu/community-projects/lens/data-sets

In [None]:
df = pd.read_csv("~/Downloads/project_1.csv") # You may change the path
df

## 5. Homework

- sign up for [Student Developer Pack](https://education.github.com/pack) of the GitHub
- think about the (research) question for project 1