# 1. General

Python uses indentation to structure code instead of curly brackets ({, }) as in C++.

In [1]:
for i in range(0, 5):
    if i == 2:
        continue
    print(i)

0
1
3
4


Comments are given after the hash (#) sign

In [3]:
# This is a comment

# 2. Variables

Everthing in Python is an instance of a class, even integers, floating point numbers, Boolean etc. And you can assign any value to any variable. On assignment, the type of the variable is set.

In [2]:
i = 54
f = 3.14
b = True

print(i, f, b, "\n")
print(type(i), type(f), type(b))

54 3.14 True 

<class 'int'> <class 'float'> <class 'bool'>


You can even assign multiple variables at once.

In [8]:
i, ff, bb = 12.23, 7.65, False

print(i, ff, bb)

print("\nvariable i is now of type", type(i))

12.23 7.65 False

variable i is now of type <class 'float'>


As we have just seen, the type of a variable can change. Python is not a strictly typed programming language like C++. This means that a variable holds both the value and the type of the data.

In [9]:
i = 12
print(i, type(i))
i = 54.2
print(i, type(i))
i = True
print(i, type(i))

12 <class 'int'>
54.2 <class 'float'>
True <class 'bool'>


# 3. Python Containers

## Lists
A list is an ordered sequence of objects, in which the objects do not need to be of the same type. List are mutuable, so that list items can be changed.

In [10]:
list = ['a', 12, 54.8, True, 76]
print(len(list))
print("list     :", list)
print("list[0]  :", list[0])
print("list[1:3]:", list[1:3])
print("list[2:] :", list[2:])
print("list[:3] :", list[:3])
print(type(list[0]))
list[0] = 65.34
print("list[0]  :", list[0])
print(type(list[0]))

5
list     : ['a', 12, 54.8, True, 76]
list[0]  : a
list[1:3]: [12, 54.8]
list[2:] : [54.8, True, 76]
list[:3] : ['a', 12, 54.8]
<class 'str'>
list[0]  : 65.34
<class 'float'>


## Tuples
A tuple is an ordered sequence of objects.

In [11]:
t = (9, "Hello Tuple", 978.12)
print(t)  #(9,"Hello Tuple", 978.12)
print("t[1]:", t[1])

(9, 'Hello Tuple', 978.12)
t[1]: Hello Tuple


Tuples are immutable. Once they are constructed, they cannot be changed anymore.

In [12]:
#t[0] = "Change Me"  #tuples cannot change

TypeError: 'tuple' object does not support item assignment

## Sets
A set is an unordered collection of unique items. Duplicates are ignored. Indexing is not supported.

In [18]:
s = {5, 9, 4, 3, 3}
print(s)

{9, 3, 4, 5}


Sets provide set operations like intersection, union, difference, etc.

In [23]:
t = {1, 5, 9, 11}
print(t)

print("Intersection:", s & t)
print("Union:", s | t)
print("Difference:", s - t)
print(s & t)
print(s | t)
print(t - s)

{1, 11, 5, 9}
Intersection: {9, 5}
Union: {1, 3, 4, 5, 9, 11}
Difference: {3, 4}
{9, 5}
{1, 3, 4, 5, 9, 11}
{1, 11}


## Dictionaries
A dictionary is an unordered collection of key-value pairs. Keys and values can be of any type.

In [27]:
dict = {1:"value", "key":2, 3:True}
print(type(dict))
print(dict)
print(dict[1])
print(dict["key"])
print(dict[3])

<class 'dict'>
{1: 'value', 'key': 2, 3: True}
value
2
True


# 4. NumPy （ナンパイ）
Python containers are very slow for large data sets for several reasons. Each item in a container is an object. The objects in a container can be of different type.

NumPy (for numerical python) is a library that provides multidimensional arrays (ndarray). Only the array is an object, the items are just values (and not objects). Items must be of the same type, e.g. integer, floating point, etc.

NumPy is most efficient if you do operations on the whole array(or a large subarray) and use the NumPy functions.

In [31]:
import numpy as np
np.__version__

'1.16.2'

## Creating arrays

Creating arrays from Python lists.

じいっｄ

In [35]:
ai = np.array([1, 2, 5, 7, 23])
print(ai)
af = np.array([5.3, 212.2, 32.2])
print(af)

[ 1  2  5  7 23]
[  5.3 212.2  32.2]


# It is possible to explicitely set the data type of the array. Some common data types are: bool_, int32, int64, float32, float64, complex64, ...

In [36]:
ai = np.array([1, 2, 5, 2, 23], dtype='float32')
ai

array([ 1.,  2.,  5.,  2., 23.], dtype=float32)

Creating arrays with NumPy functions.

In [37]:
a0 = np.zeros(10, dtype=int)
a0

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [38]:
a1 = np.ones((3, 4), dtype=float)
a1

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [39]:
np.full((3, 4), 9.9)

array([[9.9, 9.9, 9.9, 9.9],
       [9.9, 9.9, 9.9, 9.9],
       [9.9, 9.9, 9.9, 9.9]])

In [41]:
np.arange(0, 10, 3)  #np.arange(start,end,number) => increase number from start to end

array([0, 3, 6, 9])

In [42]:
np.linspace(0, 1, 5) #np.linspace(start,end,number) => split from start to end with number

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

In [53]:
rdm1 = np.random.rand(3, 3) #argument as int
rdm2 = np.random.random((3,3)) #argument as tuple
print(rdm1)
print(rdm2)

[[0.21151833 0.7998574  0.54381542]
 [0.21905386 0.79047298 0.54367601]
 [0.33490538 0.85585442 0.53728884]]
[[0.9397773  0.99620548 0.3364907 ]
 [0.98795906 0.22682654 0.98793533]
 [0.95535158 0.10703763 0.13703677]]


In [54]:
# interval [0, 10)
np.random.randint(0, 10, (3,3))

array([[8, 3, 0],
       [1, 4, 7],
       [4, 9, 5]])

In [58]:
# mean 0, standard deviation 1 (standard normal distribution)
np.random.normal(0, 1, (3,3))

array([[-0.25937419, -1.14679968, -1.1781727 ],
       [-1.11460951, -1.93541343, -0.79657158],
       [-0.47742967,  0.58578598, -0.86458317]])

In [59]:
np.eye(3) #identity matrix

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

## Attributes

In [71]:
a1 = np.random.randint(5, size=10)
a2 = np.random.randint(10, size=(3, 4))
a3 = np.random.randint(10, size=(3, 4, 5)) #size=(# of matrix,row,colomn)
print("---a1---\n", a1, "\n")
print("---a2---\n", a2, "\n")
print("---a3---\n", a3)

---a1---
 [4 0 2 1 0 2 1 4 1 2] 

---a2---
 [[3 8 0 5]
 [3 6 1 3]
 [4 9 5 3]] 

---a3---
 [[[4 3 2 1 3]
  [2 2 1 4 8]
  [4 9 9 1 7]
  [9 3 4 8 9]]

 [[8 7 3 3 8]
  [6 8 2 4 5]
  [1 9 3 2 8]
  [5 3 8 2 0]]

 [[2 6 6 5 3]
  [9 9 3 1 3]
  [0 0 1 3 9]
  [0 5 4 0 6]]]


In [70]:
print("a3 ndim:", a3.ndim)
print("a3 shape:", a3.shape)
print("a3 size:", a3.size)
print("a3 type:", a3.dtype)

a3 ndim: 3
a3 shape: (7, 4, 5)
a3 size: 140
a3 type: int64


## Indexing

In [72]:
print(a1)

[4 0 2 1 0 2 1 4 1 2]


In [73]:
print(a1[0])
print(a1[4])
print(a1[-1])

4
0
2


In [74]:
print(a2)

[[3 8 0 5]
 [3 6 1 3]
 [4 9 5 3]]


In [77]:
print(a2[0, 0])
print(a2[1, 1])
print(a2[0, -1])

3
6
5


## Slicing
Slicing allows to access subarrays using the slice notation with colons (:)

x[start, stop, step]

If any of these are not specified, they default to the values start=0, stop=size of dimension, step=1.

In [78]:
a1 = np.random.randint(10, size=10)
print(a1)
print(a1[:5])   # first five elements
print(a1[3:])   # elements after index 3
print(a1[3:8])  # subarray
print(a1[::2])  # every other element

[1 6 6 3 3 3 6 7 9 2]
[1 6 6 3 3]
[3 3 3 6 7 9 2]
[3 3 3 6 7]
[1 6 3 6 9]


In [79]:
print(a1[::-1])  # the defaults for start and stop are reversed for negative step values

[2 9 7 6 3 3 3 6 6 1]


In [87]:
print(a2, "\n")
print(a2[:2, :3]) # first two rows, first three columns
print(a2[2,3])

[[3 8 0 5]
 [3 6 1 3]
 [4 9 5 3]] 

[[3 8 0]
 [3 6 1]]
3


Multidimensional subarrays

In [88]:
print(a2[:, ::2])

[[3 0]
 [3 1]
 [4 5]]


Combining indexing and slicing to access single columns and rows:

In [89]:
print(a2[:, 0]) #everything from row, 0th column

[3 3 4]


In [90]:
print(a2[2, :]) #2nd row, everythinh from column

[4 9 5 3]


In [91]:
print(a2[2]) # in case of index to first dimension (row)

[4 9 5 3]


**Array slices are views on the array data rather than copies!!!**

In [92]:
print(a2)

[[3 8 0 5]
 [3 6 1 3]
 [4 9 5 3]]


In [94]:
a2_sub = a2[:2, :2]
print(a2_sub)
print(a2)

[[3 8]
 [3 6]]
[[3 8 0 5]
 [3 6 1 3]
 [4 9 5 3]]


In [95]:
a2_sub[0, 0] = 9
print(a2_sub, "\n") # if you change a slice of array, original array changes too!
print(a2)

[[9 8]
 [3 6]] 

[[9 8 0 5]
 [3 6 1 3]
 [4 9 5 3]]


To copy the data within an array or subarray, use the copy() method.

In [96]:
a2_sub_copy = a2[:2, :2].copy()
print(a2_sub_copy)

[[9 8]
 [3 6]]


In [97]:
a2_sub_copy[0, 0] = 3
print(a2_sub_copy, "\n")
print(a2)

[[3 8]
 [3 6]] 

[[9 8 0 5]
 [3 6 1 3]
 [4 9 5 3]]


Further basic manipulations for NumPy arrays are reshaping, joining, and splitting arrays.

## Universal Functions

Computing the reciprocals of all elements in an array with traditional a Python loop.

In [104]:
np.random.seed(0)

def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    return output

values = np.random.randint(1, 10, size=5)
compute_reciprocals(values)

array([0.16666667, 1.        , 0.25      , 0.25      , 0.125     ])

Determine how long the function takes for an array of size 1.000.000

In [101]:
big_array = np.random.randint(1, 100, size=1000000)
%timeit compute_reciprocals(big_array)

3.02 s ± 161 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Now with the NumPy universal function divide.

In [105]:
np.divide(1.0, values)

array([0.16666667, 1.        , 0.25      , 0.25      , 0.125     ])

Determine how long the function takes for the big array using the universal function.

In [107]:
%timeit np.divide(1.0, big_array)

4.22 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


**As a conclusion, the build-in python loops that repeats small operations on individual items of a container are very slow!**

<font color='red'>Better use the optimized universal functions for computations on arrays!</font> Perform a vectorized operation that will be applied to each element of the array. The loop itself is performed not by Python itself, but within the NumPy library that is implemented in the programming language C.

The main purpose of universal functions is to quickly execute common operations on values in NumPy arrays. They are either executed on the whole array or a slice of the array (view). There are unary as well as binary functions that take one or two arrays as input.

NumPy makes use of overloading the native arithmetic operators, so the syntax is very natural to use.

In [108]:
1.0 / values

array([0.16666667, 1.        , 0.25      , 0.25      , 0.125     ])

In [109]:
2.0 * values

array([12.,  2.,  8.,  8., 16.])

Arithmetic operators implemented in NumPy are addition (add), substraction (substract), negation (negative), multiplication (multiply), division (divide), floor division (floor_divide), exponentiation (power), and modulus (mod).

In [110]:
print(values)
values ** 2

[6 1 4 4 8]


array([36,  1, 16, 16, 64])

Further universal functions are absolute, trigonometric functions, exponents and logarithms, aggregate functions (min, max, sum, prod, variance, median, ...) and many more. Think of a function and it is probably available as a NumPy universal function.

# 5. Pandas

Pandas is build on top of NumPy and provides a higher level view on data. It therefore provides Series and DataFrame (as well as Index) objects that allow attaching row and column labels to multidimensional arrays, work with missing data, and powerful data operations familiar from databases and spreadsheets.

If you use the Anaconda Python stack, pandas is already installed.

In [111]:
import pandas as pd
pd.__version__

'0.24.2'

## Series

A Series object is a one-dimensional array of indexed data.

It provides both a sequence of values (NumPy array) as well as a sequence of indices.

In [112]:
s = pd.Series([64.2, 274.3, 93.21, 52.87])
print (s)

0     64.20
1    274.30
2     93.21
3     52.87
dtype: float64


In [113]:
print(s.values)
print(type(s.values))

[ 64.2  274.3   93.21  52.87]
<class 'numpy.ndarray'>


In [114]:
print(s.index)
print(type(s.index))

RangeIndex(start=0, stop=4, step=1)
<class 'pandas.core.indexes.range.RangeIndex'>


In [115]:
s[2]

93.21

In [116]:
s[0:2]

0     64.2
1    274.3
dtype: float64

One way to think of a Series object is as a one-dimensionally NumPy array with an explicitely defined index. This gives further capabilities on how to use this index.

In [117]:
s = pd.Series([9.13, 5.89, 7.37, 1.93], index=['a', 'b', 'c', 'd'])
s

a    9.13
b    5.89
c    7.37
d    1.93
dtype: float64

In [118]:
print(s['a'])
print(s['c'])

9.13
7.37


In [119]:
s['b'] = 8.98
print(s)

a    9.13
b    8.98
c    7.37
d    1.93
dtype: float64


In [120]:
print(s['a':'c'])
print('\n', s[0:2])

a    9.13
b    8.98
c    7.37
dtype: float64

 a    9.13
b    8.98
dtype: float64


Another way to think of a Series object is as a specialization of a dictionary.

In [121]:
pop_dict = {'Berlin': 3613495, 'Munich': 1456039, 'Cologne': 1080394, 'Hamburg': 1834823, 'Frankfurt a.M.': 746878 }
pop = pd.Series(pop_dict)
pop

Berlin            3613495
Munich            1456039
Cologne           1080394
Hamburg           1834823
Frankfurt a.M.     746878
dtype: int64

In [129]:
pop["Hamburg"]

1834823

In [126]:
pop['Munich':'Hamburg']

Munich     1456039
Cologne    1080394
Hamburg    1834823
dtype: int64

### DataFrame

A DataFrame is a two-dimensional array with both flexible row indices and flexible column names. It can be though of as a sequence of aligned (sharing the same index) Series objects.

In [132]:
area_dict = {'Be': 891.68, 'Munich': 310.70, 'Cologne': 405.02, 'Hamburg': 755.22, 'Frankfurt a.M.': 248.31 }
area = pd.Series(area_dict)
area

Be                891.68
Munich            310.70
Cologne           405.02
Hamburg           755.22
Frankfurt a.M.    248.31
dtype: float64

In [133]:
cities = pd.DataFrame({'population':pop, 'area in m²':area}) # if row names are same for two dictionaries, they combine in one row
cities

Unnamed: 0,population,area in m²
Be,,891.68
Berlin,3613495.0,
Cologne,1080394.0,405.02
Frankfurt a.M.,746878.0,248.31
Hamburg,1834823.0,755.22
Munich,1456039.0,310.7


In [134]:
cities.index

Index(['Be', 'Berlin', 'Cologne', 'Frankfurt a.M.', 'Hamburg', 'Munich'], dtype='object')

In [135]:
cities.columns

Index(['population', 'area in m²'], dtype='object')

A DataFrame can also be regarded as a specialized dictionary that  maps a key (column name) to a value (Series).

In [136]:
print(cities['area in m²'])
print('\n', type(cities['area in m²']))

Be                891.68
Berlin               NaN
Cologne           405.02
Frankfurt a.M.    248.31
Hamburg           755.22
Munich            310.70
Name: area in m², dtype: float64

 <class 'pandas.core.series.Series'>


**Careful, for a two dimensional NumPy array called data, data[0] gives back the first row. For a DataFrame object called cities, cities[col0] returns the column (Series).**

In [137]:
a = np.random.randint(0, 10, (3,3))
print(a, '\n')
print(a[0])

[[9 3 5]
 [2 4 7]
 [6 8 8]] 

[9 3 5]


In [138]:
vehicle_dict = {'Berlin':'B', 'Munich':'M', 'Frankfurt a.M.': 'F', 'Bremen':'HB'}
veh = pd.Series(vehicle_dict)
cities['vehicle number'] = veh
cities

Unnamed: 0,population,area in m²,vehicle number
Be,,891.68,
Berlin,3613495.0,,B
Cologne,1080394.0,405.02,
Frankfurt a.M.,746878.0,248.31,F
Hamburg,1834823.0,755.22,
Munich,1456039.0,310.7,M


## Data Indexing and Selection

In [139]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
print(data)

1    a
3    b
5    c
dtype: object


In [143]:
print(data[1])

a


In [145]:
print(s, '\n')
print(s[1])

a    9.13
b    8.98
c    7.37
d    1.93
dtype: float64 

8.98


Using a integer index that is not in the explicit index object causes an error:

In [148]:
data(data[1])

TypeError: 'Series' object is not callable

But slicing works with the integer index, but maybe not as expected.

In [149]:
print(data[1:3])

3    b
5    c
dtype: object


In [150]:
cities

Unnamed: 0,population,area in m²,vehicle number
Be,,891.68,
Berlin,3613495.0,,B
Cologne,1080394.0,405.02,
Frankfurt a.M.,746878.0,248.31,F
Hamburg,1834823.0,755.22,
Munich,1456039.0,310.7,M


In [151]:
cities['population']

Be                      NaN
Berlin            3613495.0
Cologne           1080394.0
Frankfurt a.M.     746878.0
Hamburg           1834823.0
Munich            1456039.0
Name: population, dtype: float64

In [152]:
cities_pop = cities.population

Insert a new column as the result of a computation with a universal function.

In [153]:
cities['density'] = cities['population'] / cities['area in m²']
cities

Unnamed: 0,population,area in m²,vehicle number,density
Be,,891.68,,
Berlin,3613495.0,,B,
Cologne,1080394.0,405.02,,2667.507777
Frankfurt a.M.,746878.0,248.31,F,3007.845032
Hamburg,1834823.0,755.22,,2429.521199
Munich,1456039.0,310.7,M,4686.317992


In [154]:
cities.values

array([[nan, 891.68, nan, nan],
       [3613495.0, nan, 'B', nan],
       [1080394.0, 405.02, nan, 2667.507777393709],
       [746878.0, 248.31, 'F', 3007.8450324191535],
       [1834823.0, 755.22, nan, 2429.5211991207857],
       [1456039.0, 310.7, 'M', 4686.3179916317995]], dtype=object)

In [155]:
cities.T

Unnamed: 0,Be,Berlin,Cologne,Frankfurt a.M.,Hamburg,Munich
population,,3.6135e+06,1080390.0,746878,1834820.0,1.45604e+06
area in m²,891.68,,405.02,248.31,755.22,310.7
vehicle number,,B,,F,,M
density,,,2667.51,3007.85,2429.52,4686.32


Accessing a specific row:

In [156]:
cities.values[1]

array([3613495.0, nan, 'B', nan], dtype=object)

Accessing a specific column:

In [157]:
cities['population']

Be                      NaN
Berlin            3613495.0
Cologne           1080394.0
Frankfurt a.M.     746878.0
Hamburg           1834823.0
Munich            1456039.0
Name: population, dtype: float64

### loc and iloc indexers for Series objects

The loc attribute allows indexing and slicing that always references the explicit index.

In [158]:
print(data)

1    a
3    b
5    c
dtype: object


In [159]:
data.loc[1]

'a'

In [160]:
data.loc[1:3]

1    a
3    b
dtype: object

The iloc attribute allows indexing and slicing that always references the implicit Python-style index.

In [161]:
data.iloc[1]

'b'

In [162]:
data.iloc[1:3]

3    b
5    c
dtype: object

The explicit indexers loc and iloc make cleaner code and are suggested to be used, especially with explicit integer indexes. 

### loc and iloc for DataFrame objects

In [163]:
cities

Unnamed: 0,population,area in m²,vehicle number,density
Be,,891.68,,
Berlin,3613495.0,,B,
Cologne,1080394.0,405.02,,2667.507777
Frankfurt a.M.,746878.0,248.31,F,3007.845032
Hamburg,1834823.0,755.22,,2429.521199
Munich,1456039.0,310.7,M,4686.317992


The iloc indexer allows to index the underlying array as if it is a simple NumPy array.

In [164]:
cities.iloc[1:4, :2]

Unnamed: 0,population,area in m²
Berlin,3613495.0,
Cologne,1080394.0,405.02
Frankfurt a.M.,746878.0,248.31


Using the explicit index of the DataFrame with loc.

In [165]:
cities.loc[:'Cologne', 'vehicle number':'density']

Unnamed: 0,vehicle number,density
Be,,
Berlin,B,
Cologne,,2667.507777


### What else is there?

Combine masking and fancy indexing:

In [167]:
cities.loc[cities.density > 3000, ['population', 'density']]

Unnamed: 0,population,density
Frankfurt a.M.,746878.0,3007.845032
Munich,1456039.0,4686.317992


In [168]:
cities.density > 100

Be                False
Berlin            False
Cologne            True
Frankfurt a.M.     True
Hamburg            True
Munich             True
Name: density, dtype: bool