# 1. General

Python uses indentation to structure code instead of curly brackets ({, }) as in C++.

In [1]:
for i in range(0, 5):
    if i == 2:
        continue
    print(i)

0
1
3
4


Comments are given after the hash (#) sign

In [None]:
# This is a comment

# 2. Variables

Everthing in Python is an instance of a class, even integers, floating point numbers, Boolean etc. And you can assign any value to any variable. On assignment, the type of the variable is set.

In [None]:
i = 54
f = 3.14
b = True

print(i, f, b, "\n")
print(type(i), type(f), type(b))

You can even assign multiple variables at once.

In [None]:
i, ff, bb = 12.23, 7.65, False

print(i, ff, bb)

print("\nvariable i is now of type", type(i))

As we have just seen, the type of a variable can change. Python is not a strictly typed programming language like C++. This means that a variable holds both the value and the type of the data.

In [None]:
i = 12
print(i, type(i))
i = 54.2
print(i, type(i))
i = True
print(i, type(i))

# 3. Python Containers

## Lists
A list is an ordered sequence of objects, in which the objects do not need to be of the same type. List are mutuable, so that list items can be changed.

In [None]:
list = ['a', 12, 54.8, True, 76]
print(len(list))
print("list     :", list)
print("list[0]  :", list[0])
print("list[1:3]:", list[1:3])
print("list[2:] :", list[2:])
print("list[:3] :", list[:3])
print(type(list[0]))
list[0] = 65.34
print("list[0]  :", list[0])
print(type(list[0]))

## Tuples
A tuple is an ordered sequence of objects.

In [None]:
t = (9, "Hello Tuple", 978.12)
print(t)
print("t[1]:", t[1])

Tuples are immutable. Once they are constructed, they cannot be changed anymore.

In [None]:
#t[0] = "Change Me"

## Sets
A set is an unordered collection of unique items. Duplicates are ignored. Indexing is not supported.

In [None]:
s = {5, 9, 3, 7}
print(s)

Sets provide set operations like intersection, union, difference, etc.

In [None]:
t = {1, 5, 9, 11}
print(t)

print("Intersection:", s & t)
print("Union:", s | t)
print("Difference:", s - t)

## Dictionaries
A dictionary is an unordered collection of key-value pairs. Keys and values can be of any type.

In [None]:
dict = {1:"value", "key":2, 3:True}
print(type(dict))
print(dict)
print(dict[1])
print(dict["key"])
print(dict[3])

# 4. NumPy
Python containers are very slow for large data sets for several reasons. Each item in a container is an object. The objects in a container can be of different type.

NumPy (for numerical python) is a library that provides multidimensional arrays (ndarray). Only the array is an object, the items are just values (and not objects). Items must be of the same type, e.g. integer, floating point, etc.

NumPy is most efficient if you do operations on the whole array(or a large subarray) and use the NumPy functions.

In [None]:
import numpy as np
np.__version__

## Creating arrays

Creating arrays from Python lists.

In [None]:
ai = np.array([1, 2, 5, 7, 23])
print(ai)
af = np.array([5.3, 212.2, 32.2])
print(af)

It is possible to explicitely set the data type of the array. Some common data types are: bool_, int32, int64, float32, float64, complex64, ...

In [None]:
ai = np.array([1, 2, 5, 2, 23], dtype='float32')
ai

Creating arrays with NumPy functions.

In [None]:
a0 = np.zeros(10, dtype=int)
a0

In [None]:
a1 = np.ones((3, 4), dtype=float)
a1

In [None]:
np.full((3, 4), 9.9)

In [None]:
np.arange(0, 10, 2)

In [None]:
np.linspace(0, 1, 5)

In [None]:
np.random.random((3, 3))

In [None]:
# interval [0, 10)
np.random.randint(0, 10, (3,3))

In [None]:
# mean 0, standard deviation 1
np.random.normal(0, 1, (3,3))

In [None]:
np.eye(3)

## Attributes

In [None]:
a1 = np.random.randint(10, size=6)
a2 = np.random.randint(10, size=(3, 4))
a3 = np.random.randint(10, size=(3, 4, 5))
print("---a1---\n", a1, "\n")
print("---a2---\n", a2, "\n")
print("---a3---\n", a3)

In [None]:
print("a3 ndim:", a3.ndim)
print("a3 shape:", a3.shape)
print("a3 size:", a3.size)
print("a3 type:", a3.dtype)

## Indexing

In [None]:
print(a1)

In [None]:
print(a1[0])
print(a1[4])
print(a1[-1])

In [None]:
print(a2)

In [None]:
print(a2[0, 0])
print(a2[1, 1])
print(a2[2, -1])

## Slicing
Slicing allows to access subarrays using the slice notation with colons (:)

x[start, stop, step]

If any of these are not specified, they default to the values start=0, stop=size of dimension, step=1.

In [None]:
a1 = np.random.randint(10, size=10)
print(a1)
print(a1[:5])   # first five elements
print(a1[3:])   # elements after index 3
print(a1[3:8])  # subarray
print(a1[::2])  # every other element

In [None]:
print(a1[::-1])  # the defaults for start and stop are reversed for negative step values

In [None]:
print(a2, "\n")
print(a2[:2, :3]) # first two rows, first three columns

Multidimensional subarrays

In [None]:
print(a2[:, ::2])

Combining indexing and slicing to access single columns and rows:

In [None]:
print(a2[:, 0])

In [None]:
print(a2[2, :])

In [None]:
print(a2[2]) # in case of index to first dimension (row)

**Array slices are views on the array data rather than copies!!!**

In [None]:
print(a2)

In [None]:
a2_sub = a2[:2, :2]
print(a2_sub)

In [None]:
a2_sub[0, 0] = 9
print(a2_sub, "\n")
print(a2)

To copy the data within an array or subarray, use the copy() method.

In [None]:
a2_sub_copy = a2[:2, :2].copy()
print(a2_sub_copy)

In [None]:
a2_sub_copy[0, 0] = 3
print(a2_sub_copy, "\n")
print(a2)

Further basic manipulations for NumPy arrays are reshaping, joining, and splitting arrays.

## Universal Functions

Computing the reciprocals of all elements in an array with traditional a Python loop.

In [None]:
np.random.seed(0)

def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    return output

values = np.random.randint(1, 10, size=5)
compute_reciprocals(values)

Determine how long the function takes for an array of size 1.000.000

In [None]:
big_array = np.random.randint(1, 100, size=1000000)
%timeit compute_reciprocals(big_array)

Now with the NumPy universal function divide.

In [None]:
np.divide(1.0, values)

Determine how long the function takes for the big array using the universal function.

In [None]:
%timeit np.divide(1.0, big_array)

**As a conclusion, the build-in python loops that repeats small operations on individual items of a container are very slow!**

<font color='red'>Better use the optimized universal functions for computations on arrays!</font> Perform a vectorized operation that will be applied to each element of the array. The loop itself is performed not by Python itself, but within the NumPy library that is implemented in the programming language C.

The main purpose of universal functions is to quickly execute common operations on values in NumPy arrays. They are either executed on the whole array or a slice of the array (view). There are unary as well as binary functions that take one or two arrays as input.

NumPy makes use of overloading the native arithmetic operators, so the syntax is very natural to use.

In [None]:
1.0 / values

In [None]:
2.0 * values

Arithmetic operators implemented in NumPy are addition (add), substraction (substract), negation (negative), multiplication (multiply), division (divide), floor division (floor_divide), exponentiation (power), and modulus (mod).

In [None]:
print(values)
values ** 2

Further universal functions are absolute, trigonometric functions, exponents and logarithms, aggregate functions (min, max, sum, prod, variance, median, ...) and many more. Think of a function and it is probably available as a NumPy universal function.

# 5. Pandas

Pandas is build on top of NumPy and provides a higher level view on data. It therefore provides Series and DataFrame (as well as Index) objects that allow attaching row and column labels to multidimensional arrays, work with missing data, and powerful data operations familiar from databases and spreadsheets.

If you use the Anaconda Python stack, pandas is already installed.

In [None]:
import pandas as pd
pd.__version__

## Series

A Series object is a one-dimensional array of indexed data.

It provides both a sequence of values (NumPy array) as well as a sequence of indices.

In [None]:
s = pd.Series([64.2, 274.3, 93.21, 52.87])
print (s)

In [None]:
print(s.values)
print(type(s.values))

In [None]:
print(s.index)
print(type(s.index))

In [None]:
s[2]

In [None]:
s[0:2]

One way to think of a Series object is as a one-dimensionally NumPy array with an explicitely defined index. This gives further capabilities on how to use this index.

In [None]:
s = pd.Series([9.13, 5.89, 7.37, 1.93], index=['a', 'b', 'c', 'd'])
s

In [None]:
print(s['a'])
print(s['c'])

In [None]:
s['b'] = 8.98
print(s)

In [None]:
print(s['a':'c'])
print('\n', s[0:2])

Another way to think of a Series object is as a specialization of a dictionary.

In [None]:
pop_dict = {'Berlin': 3613495, 'Munich': 1456039, 'Cologne': 1080394, 'Hamburg': 1834823, 'Frankfurt a.M.': 746878 }
pop = pd.Series(pop_dict)
pop

In [None]:
pop['Hamburg']

In [None]:
pop['Munich':'Hamburg']

### DataFrame

A DataFrame is a two-dimensional array with both flexible row indices and flexible column names. It can be though of as a sequence of aligned (sharing the same index) Series objects.

In [None]:
area_dict = {'Berlin': 891.68, 'Munich': 310.70, 'Cologne': 405.02, 'Hamburg': 755.22, 'Frankfurt a.M.': 248.31 }
area = pd.Series(area_dict)
area

In [None]:
cities = pd.DataFrame({'population':pop, 'area in m²':area})
cities

In [None]:
cities.index

In [None]:
cities.columns

A DataFrame can also be regarded as a specialized dictionary that  maps a key (column name) to a value (Series).

In [None]:
print(cities['area in m²'])
print('\n', type(cities['area in m²']))

**Careful, for a two dimensional NumPy array called data, data[0] gives back the first row. For a DataFrame object called cities, cities[col0] returns the column (Series).**

In [None]:
a = np.random.randint(0, 10, (3,3))
print(a, '\n')
print(a[0])

In [None]:
vehicle_dict = {'Berlin':'B', 'Munich':'M', 'Frankfurt a.M.': 'F', 'Bremen':'HB'}
veh = pd.Series(vehicle_dict)
cities['vehicle number'] = veh
cities

## Data Indexing and Selection

In [None]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
print(data)

In [None]:
print(data[1])

In [None]:
print(s, '\n')
print(s[1])

Using a integer index that is not in the explicit index object causes an error:

In [None]:
#data(data[0])

But slicing works with the integer index, but maybe not as expected.

In [None]:
print(data[1:3])

In [None]:
cities

In [None]:
cities['population']

In [None]:
cities_pop = cities.population

Insert a new column as the result of a computation with a universal function.

In [None]:
cities['density'] = cities['population'] / cities['area in m²']
cities

In [None]:
cities.values

In [None]:
cities.T

Accessing a specific row:

In [None]:
cities.values[1]

Accessing a specific column:

In [None]:
cities['population']

### loc and iloc indexers for Series objects

The loc attribute allows indexing and slicing that always references the explicit index.

In [None]:
print(data)

In [None]:
data.loc[1]

In [None]:
data.loc[1:3]

The iloc attribute allows indexing and slicing that always references the implicit Python-style index.

In [None]:
data.iloc[1]

In [None]:
data.iloc[1:3]

The explicit indexers loc and iloc make cleaner code and are suggested to be used, especially with explicit integer indexes. 

### loc and iloc for DataFrame objects

In [None]:
cities

The iloc indexer allows to index the underlying array as if it is a simple NumPy array.

In [None]:
cities.iloc[1:4, :2]

Using the explicit index of the DataFrame with loc.

In [None]:
cities.loc[:'Cologne', 'vehicle number':'density']

### What else is there?

Combine masking and fancy indexing:

In [None]:
cities.loc[cities.density > 100, ['population', 'density']]

In [None]:
cities.density > 100