# Introduction to Python for scientific research
*Bart van Stratum, Max Planck Institute for Meteorology, Hamburg*

AES Python workshop, 7-8 October 2015, MPI-M

This notebook is available at https://github.com/julietbravo/python_workshop_AES

### Outline / programm:
My talk:
1. Introduction
2. Running Python (on MPI systems, and your PC)
3. ...

Chiel's talk:
1. .....

Sebastian's talk:
1. .....

Lectures are only intended to provide a very basic introduction, focus is on the practical!

# 1 Introduction
### The origin of Python

> *"Over six years ago, in December 1989, I was looking for a "hobby" programming project that would keep me occupied during the week around Christmas...”* — Guido van Rossum

### Python: general-purpose, high-level programming language 
* General purpose: more than a plotting language!  
􏰀  * One of the key languages at e.g. Google, CERN, NASA   
* High-level: focus on code readability, short syntax  
* Interpreted language

### Why use Python for your research?
*(my personal opinion, and in random order)*

1. Easy to learn, and read / write code
2. It's free! (unlike Matlab, IDL, ...)
3. Well documented (I'll get back to that), easy to get support (Stackoverflow)
4. Capable of handling your entire scientific workflow

### Python 2 vs. Python 3
Extensive discussion at https://wiki.python.org/moin/Python2orPython3
> Python 2.x is legacy, Python 3.x is the present and future

* Python 3: released in 2008, stable and actively developed version, currently at version 3.5
* Python 2: last release in 2010, maintained but not actively developed
  
Python 3 = backwards-incompatible release! Some Python 3 features backported to 2.7

#### My recommendation: use Python 3

# 2 Getting started with Python
## 2.1 Install or load Python
Python core by default installed on (almost?) all Linux distributions and OS X. Useful extensions (discussed later) available through package managers  

##### On MPI systems (desktops, Thunder): 
* `module load python` (Python 2.7)
* `module load python3/3.4.2` (Python 3.4)

##### OS X using Macports:
* `sudo port install python34 py34-ipython py34-numpy py34-matplotlib py34-netcdf4`

##### Windows (untested):
* Enthought Canopy (https://www.enthought.com)  
  * Free version available which excludes... support for plotting maps

## 2.2 Running Python code

Two options:
1. Write code in a script (.py extension), call Python interpreter (`$ python xx.py`)
2. Run code in interactive Python shell (IPython), great for development / debugging: 
  * Allows you to run scripts, interactively type code, access variables, etc.

Example IPython (start from command line with `ipython`):

#### Script: hello_world.py  
`print('hello world!')`  
`pi = 3.1415`


In [16]:
run hello_world.py

hello world!


Directly access variable in Python interpreter:

In [17]:
3*pi

9.4245

In [3]:
import numpy as np
np.max?

## 2.3 Python extensions
Python standard library provides basic functionality, often useful (or necessary) to extend

Useful (for us..) examples:
1. NumPy: multi-dimensional arrays + functions to manipulate them
2. matplotlib (pylab) + basemap: plotting (line, contour, maps, ..)
3. NetCDF4: reading / writing netCDF files
4. .....





#### How to use the extensions: import statement

In [18]:
import numpy # import using default name
numpy.zeros(2), numpy.ones(2)

(array([ 0.,  0.]), array([ 1.,  1.]))

In [19]:
import numpy as np # abbreviated name
np.zeros(2), np.ones(2)

(array([ 0.,  0.]), array([ 1.,  1.]))

In [20]:
reset -f 

In [21]:
from numpy import zeros # specific import
ones(2) # undefined

NameError: name 'ones' is not defined

In [22]:
from numpy import * # import everything
zeros(2), ones(2)

(array([ 0.,  0.]), array([ 1.,  1.]))

`import *` can by tricky:

In [23]:
pi = 0
from numpy import * # contains definition of pi
pi

3.141592653589793

## 2.4 Python basics
### 2.4.1 Variables

Python uses *duck typing*:
> *"If a bird looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck."*

In [24]:
var_1 = 1
var_2 = 1.
var_3 = "1"

type(var_1), type(var_2), type(var_3)

(int, float, str)

Explicit casting:

In [25]:
var_1 = float(1)
var_2 = int(1.)
var_3 = float("1")

type(var_1), type(var_2), type(var_3)

(float, int, float)

### 2.4.2 Data structures

#### 2.4.2.1 Lists: constructor [a,b,c]

In [26]:
time = [0,1,2,3,4,5,6,7,8,9]

Values accesible based on index or slice:  

In [27]:
time[0], time[1:4], time[0:6:2]

(0, [1, 2, 3], [0, 2, 4])

*Note that indexing starts at zero, slicing exludes last element*

Lists sizes can directly be changed:

In [28]:
time.append(10) # Append a value
time.remove(0)  # Remove first occurance of value
time.pop(5)     # Remove an element
time

[1, 2, 3, 4, 5, 7, 8, 9, 10]

#### 2.4.2.2 Tuples: constructor (a,b,c) 
Like a list, but not mutable

In [29]:
stations = ('hamburg','cabauw','karlsruhe')
stations

('hamburg', 'cabauw', 'karlsruhe')

After defining them, unable to change, append or remove items

#### 2.4.2.3 Dictionaries: constructor {'a':0, 'b':1}
Combination of keys and values, values accesible by keys:

In [30]:
data = {'T': 290, 'q': 10e-3, 'u': -2}
data['T'], data['u']

(290, -2)

Are mutable:

In [31]:
data['T'] = 300  # Change element
data['v'] = 3    # Add element
data.pop('u')    # Remove element (or del data['u'])
data

{'T': 300, 'q': 0.01, 'v': 3}

#### 2.4.2.4 Numpy arrays

In [32]:
import numpy as np
a1 = np.zeros(4)  # Array initialized with zeros
a2 = np.ones(4)   #   "         "      "   ones
a3 = np.empty(4)  # Uninitialized array

Multi-dimensional arrays:

In [33]:
a4 = np.zeros((2,2,2))
a5 = np.zeros((2,2,2), dtype=np.int)

More information on Numpy arrays and how to efficiently use them in Chiel's talk

### 2.4.3 Objects
Python is object-oriented (OO); enough material to cover a one-week workshop, so I'll only provide a single useful example

Example: reading in multiple files

In [34]:
import netCDF4 as nc4

f1     = nc4.Dataset('data/drycblles_1.nc', 'r')
time1  = f1.variables["t"][:]
ustar1 = f1.variables["ustar"][:]
obuk1  = f1.variables["obuk"][:]
# and ~60 more variables

f2     = nc4.Dataset('data/drycblles_2.nc', 'r')
time2  = f2.variables["t"][:]
ustar2 = f2.variables["ustar"][:]
# etc.

f3     = nc4.Dataset('data/drycblles_3.nc', 'r')
time3  = f3.variables["t"][:]
# etc.

Lots of double code, every change (add/remove variable, ..) requires code change for every file

#### Object oriented approach:

In [35]:
import netCDF4 as nc4
 
class read_nc:    # class definition
    def __init__(self, file_name):    # constructor
        f = nc4.Dataset(file_name, 'r')
        
        self.time  = f.variables["t"][:]
        self.ustar = f.variables["ustar"][:]
        self.obuk  = f.variables["obuk"][:]
        # and 100 more variables
    
    def post_process(self):
        self.time_hour = self.time / 3600.
        
r1 = read_nc('data/drycblles_1.nc')
r2 = read_nc('data/drycblles_2.nc')
r3 = read_nc('data/drycblles_3.nc')

r1.time

array([    0.,   300.,   600.,   900.,  1200.,  1500.,  1800.,  2100.,
        2400.,  2700.,  3000.,  3300.,  3600.,  3900.,  4200.,  4500.,
        4800.,  5100.])

In [36]:
r1.post_process()
r1.time_hour

array([ 0.        ,  0.08333333,  0.16666667,  0.25      ,  0.33333333,
        0.41666667,  0.5       ,  0.58333333,  0.66666667,  0.75      ,
        0.83333333,  0.91666667,  1.        ,  1.08333333,  1.16666667,
        1.25      ,  1.33333333,  1.41666667])

### 2.4.4 Loops

Two options: `for:` and `while:`

In [44]:
for i in range(2): # equivalent to range(0,2)
    print(i)

0
1


In [5]:
i = 10
while i > 1:
    i /= 2   # i = i / 2

Indentation determines body of loop:

In [66]:
for i in range(2):
    for j in range(2):
        for k in range(2):
            pass # i.e. do nothing
        # part of j-loop

Python is not very strict on indentation:

In [69]:
for i in range(1):
 pass
for i in range(1):
                   pass
for i in range(1):
         pass

for i in range(1):
             for j in range(1):
              for k in range(1):
                                 pass

> *"How To Write Unmaintainable Code and Ensure a job for life ;-)"*
https://www.thc.org/root/phun/unmaintain.html

### 2.4.4 Functions

In [1]:
cp = 1004.
Rd = 287.
def exner(p, p0):
    return (p/p0)**(Rd/cp)

exner(90000, 1e5)  

0.970331031603382

Default arguments are possible:

In [2]:
def exner(p, p0=1e5):
    return (p/p0)**(Rd/cp)

exner(90000), exner(90000,101300)

(0.970331031603382, 0.9667549928755136)

Easy to operate on arrays:

In [4]:
import numpy as np

def exner(p, p0=1e5):
    return (p/p0)**(Rd/cp)

p = np.linspace(70000, 101300, 10)
exner(p)

array([ 0.90306759,  0.91567175,  0.92785679,  0.93965467,  0.95109368,
        0.96219894,  0.97299292,  0.98349577,  0.99372566,  1.00369901])

# 2 Tips 'n tricks
## 2.1 Input & output
### 2.1.1 Text files  
The easy ones (structured files, only numbers):

In [8]:
import numpy as np

f = np.loadtxt('data/some_data.txt')
time = f[:,0]
T    = f[:,1]
Td   = f[:,2]

time, T, Td

(array([   0.,   30.,   60.,   90.,  120.]),
 array([  290.,   291.,   292., -9999.,   294.]),
 array([  285.,   286.,   287.,   288., -9999.]))

Structured files, but mixed data (numbers, string, ..): see `np.genfromtxt`

Last resort:

In [9]:
f = open('data/some_data.txt', 'r')
for line in f:
    print(line.split())

['#', 'time', 'T', 'Td']
['0', '290', '285']
['30', '291', '286']
['60', '292', '287']
['90', '-9999', '288']
['120', '294', '-9999']


### Intermezzo: dealing with bad/missing data

NetCDF: `fill_value` from NetCDF file is automatically masked

Manually masking arrays (`numpy.ma`):

In [21]:
T, Td

(array([  290.,   291.,   292., -9999.,   294.]),
 array([  285.,   286.,   287.,   288., -9999.]))

In [22]:
T_m  = np.ma.masked_where(T  == -9999, T ) 
Td_m = np.ma.masked_where(Td == -9999, Td)
T_m

masked_array(data = [290.0 291.0 292.0 -- 294.0],
             mask = [False False False  True False],
       fill_value = 1e+20)

Masked values are exluded from statistics, plots, etc.:

In [19]:
T.mean(), T_m.mean()

(-1766.4000000000001, 291.75)

Masks are propagated:

In [20]:
T_m-Td_m

masked_array(data = [5.0 5.0 5.0 -- --],
             mask = [False False False  True  True],
       fill_value = 1e+20)