# 01 Brief review of Python

Part of ["Introduction to Data Science" course](https://github.com/kupav/data-sc-intro) by Pavel Kuptsov, [kupav@mail.ru](mailto:kupav@mail.ru)

Recommended reading for this section:

1. Ben Stephenson. (2019). The Python Workbook : A Brief Introduction with Exercises and Solutions. 
2. Vanderplas, J.T. (2016). Python data science handbook: Essential tools for working with data.

The following Python module will be required. Make sure that you have it installed
- `numpy`
- `re`

## Lesson 1

### Hello world

Standard first step in learning of any language: program "Hello world". 

It is pretty short in Python

In [None]:
print("Hello, world!")

Each `print` by default ends line, 

In [None]:
print("Hello,")
print("world!")

To change it, one can do as follows

In [None]:
print("Hello,", end="_")
print("world!")

One can print several elements

In [None]:
print("Hello", ", ", "world", "!")

By default, elements are separated by space `" "`. To change it:

In [None]:
print("Hello", ", ", "world", "!", sep="")

### Comments

Symbol `#` means a comment. Everything to the right will be ignored by Python.

The following code does nothing

In [None]:
# print("Hello, world!")

### Variables

A variable is a named location in a computer memory that holds a value.

Variable names in Python: letters, numbers and underscores. Name cannot begin with a number. Letters are case sensitive.

Correct variable names:

```python
step_size
StepSize
stepSize
size21
step1_size5
```

Incorrect names:

```python
step size # no spaces inside are allowed
1_step # no numbers in the beginning
featur%one # symbol % is not allowed
```

Variables are created using assignment statement: the name appears to the left of `=` the value that will be stored in the variable appears to the right of it.

To see what is stored in a variable we use `print()`.

In [None]:
s = "one, two, three"
a = 4
print(s, a)

A variable can contain a special value `None`. It means "no value is held in this variable 
but it exists yet".

In [None]:
x = None
print(x)

### Expressions and math operators

The right hand side of the assignment statement can be an expression: It can have numbers, and standard math operators like (`+`), (`-`), (`/`), (`*`). 

It also can include variables.

Not obvious operators: exponentiation (`**`), floor division (`//`), module (`%`).

In [None]:
x = 1
y = x + 1
z = y ** 3
print("x=", x, "y=", y, "z=", z)

Figure out how operator (`//`) and (`%`) work

In [None]:
x = 14
y = 5
d = x / y
print("d=", d)

Operator (`//`) computes the floor of the quotient that results when one number is divided by another. 

In [None]:
f = x // y
print("f=", f)

Operator (`%`) computes the remainder when one number is divided by another.

In [None]:
r = x % y
print("r=", r)

Check.

In [None]:
x1 = y * f + r
print("x=", x, "x1=", x1)

The same variable can appear on both sides.

In [None]:
x = 15
x = x + 2
print(x)

There is a convenient compact form for it `x += 2`. Works with all math operators.

In [None]:
x = 15
print(x)
x += 2
print(x)
x *= 3
print(x)

### Calling functions 

Parts of code can be wrapped into functions.

- to avoid retyping
- to increase clarity

An example: let us make a rounding of a real number. We can use floor division (`//`) 
for it. 

Let us see what happen when we divide by 1:

In [None]:
print(15.2 // 1)
print(15.4 // 1)
print(15.6 // 1)
print(15.8 // 1)

But we want `15.6 -> 16`. Just add `0.5` before division.

In [None]:
x = 15.4
rx = (x + 0.5) // 1
print(rx)

In [None]:
x = 15.6
rx = (x + 0.5) // 1
print(rx)

It works good. But we don't want to retype it again and again. A function will help. 

We could write a function for our rounding procedure, but Python already has one: `round()`

In [None]:
x = 15.4
print(round(x))

In [None]:
x = 15.6
print(round(x))

By the way: `print()` is also a function.

Notice: when we call a function we write its name followed by its parameters in parenthesis.

Also notice a bit different results: our rounding expression returns numbers with the decimal point, `15.0`, `16.0`, while function `round` gives `15`, `16`. This is because `round` additionally performs a type conversion: from real `15.0` to integer `15`.

### Types

Content of a variable always has type. Simple standard types:

- `int` - integer number, no decimal point (`12`, `-2312`)
- `float` - real number, decimal point or `e`-symbol is used (`.321`, `12.0`, `34.591`, `1.28e-11`, `1e13`)
- `complex` - complex number, includes imaginary unit (`3j`, `12.3+3.2j`) 
- `str` - text string, enclosed in quotation marks or apostrophes (`"one and two"`, `'Data'`)
- `bool` - boolean value, can be either `True` or `False`

Why there are many types? Because different types are processed in different ways. 

Types can be converted. Just use type as a function name

In [None]:
x = 15.6
rx = (x + 0.5) // 1
irx = int(rx)  # converts here float variable rx to int type
print("rx=", rx, "irx=", irx)

In [None]:
x = 953.3
s = "x=" + str(x)  # here x is converted to string and concatenated with 'x='
print(s)

Types can be checked using function `type()`

In [None]:
i = 2
print(type(i))

In [None]:
x = 2.3
print(type(x))

In [None]:
c = 3.4 + 5.6j
print(type(c))

In [None]:
s = "expansion"
print(type(s))

### Working with strings

Operation on strings

- computing length of a string
- concatenating two strings
- taking substrings and individual symbols

Length is computed with function `len()`

In [None]:
s1 = "Data"
string_length = len(s1)
print(string_length)

Concatenation of strings is done with (`+`)

In [None]:
s1 = "this is a string"
s2 = s1 + " and " + s1
print(s2)

Taking symbols and substrings: Key idea is that symbols are enumerated starting from zero:
```
   01234567
s="compound"
```
Elements are accessed by index:
```
s[0] -> c
s[1] -> o
s[2] -> m
s[3] -> p
...
```
Substrings are accessed via ranges of index
```
s[0:3] -> 'com'
s[:3]  -> 'com'
s[3:8] -> 'pound'
s[-3:] -> 'und'
```

In [None]:
# Symbol positions start from 0
#     012345678901234"  
s1 = "this is string!"
print(s1)
w1 = s1[0:4]
w2 = s1[5:7]
w3 = s1[-7:-1]
s2 = w2 + " " + w1 + " " + w3 + "?"
print(s2)

Number can be appear in a string:
```python
s = "value of N is 999"
```
Or can be written as a string
```python
a = "1.234"
```
To use it in computations one have to convert it to a numerical type, `int` or `float`.

In [None]:
s = "value of N is 999"
N = s[-3:]
print(N)
print(type(N))  # N is a string, not a number

In [None]:
N1 = N + 1  # Error, types are incompatible

In [None]:
N = int(s[-3:])  # Convert string '999' to number 999
print(N)
print(type(N))

In [None]:
N1 = N + 1  # Now N can appear in computations
print(N1)

In [None]:
a = "1.234"
b = a + 1.766  # Error because types are incompatible

In [None]:
b = float(a) + 1.766  # Now wlikeorks
print(b)

### String formatting

Concatenation is often not enough to represent information appropriately. Complicated cases are processed via string formating: curly brackets and `.format()`

In [None]:
x = 10.0 / 7.0
y = 13.0 / 7.0
s = 'Results: x={} and y={}'.format(x, y)
print(s)

Printed values can be rounded (but actual variable values are unchanged)
```python
'{:with.precision}'.format(x)
```
Here `with` is total with, and `precision` is precision.

In [None]:
x = 10.0 / 7.0
y = 13.0 / 7.0
s = 'Results: x={:4.3} and y={:5.4}'.format(x, y)
print(s)

f-strings is more convenient way of formating. One puts `f`-symbol before the string and write variables inside curly brackets.

In [None]:
x = 10.0 / 7.0
y = 13.0 / 7.0
s = f'Results: x={x:4.3} and y={y:5.4}'
print(s)

### Reading input

Program can read input from the keyboard by calling the `input()` function. Data are obtained as strings. The programmer must convert it to the proper type.

If called with a string argument like this: `input('Enter a value')` a hint message is printed.

In [None]:
s = input("Input i")
print(type(s))

In [None]:
i = int(s)
print(i + 15)

Keyboard input is typically not used in data science. Data are usually read from files.

### Modules

Some functions are available in Python anytime and anywhere: `print()`, `input()`, `round()`

Less common functions are collected in modules. Before using they have to be imported to the program using command `import`

Math functions live in module `math`. Its import is done with the command
```python
import math
```
Using module content: add prefix `math.` before function name:
```python
a = math.sin(0.1)
```

In [None]:
import math

x = (math.sqrt(5) + 1) / 2
print(x)

If a module name is too long one can rename it when importing

In [None]:
import datetime as dt

today = dt.date.today()
print("Today's date:", today)

Or one can merge a module content to the global name space - no reference to the module name will be required

In [None]:
from random import *

print(randint(1, 10))

Sometimes only one function is needed. Then we import exactly what we want

In [None]:
from math import log10

print(log10(10**5))

### Module NumPy

Standard module `math` is usually not used in data science. It is not enough for really serious math computations. 

In fact the standard for serious work is `numpy` module. It contains all functions from `math` and much much more.

In [None]:
import numpy as np  # this is standard form of import

x = np.sin(np.pi / 2)**2 + np.cos(np.pi / 2)**2
print(x)

Notice that `math` is a part of Python and is always available. Module `numpy` must be installed separately. 

## Lesson 2

### Conditional constructs

Conditional constructs allows branching: statements may or may not be executed depending on some conditions.

Forms of conditional constructs:
- if
- if-else
- if-elif-else
- if-elif

Remark for those who is familiar with other languages: Python does not have select-case statements. Their are simulated via if-elif-else constructs.

### If statements

The simplest form consists of `if` statement only:
```python
if <condition_is_true>:
    <do_something>
<continue_working>
```
Observe colon `:` ending the line of `if` statement. It must be there according to the language rules.

Statements `<do_something>` are executed only if the result of the condition evaluation is `True`. Otherwise this construct does nothing. Statement `<continue_working>` are executed in any case.

The body of an `if` statement consists of one or more statements that must be **indented more than the `if` keyword**. 

The body ends before the next line that is indented **the
same amount as (or less than) the `if` keyword**. 

Programmers can choose how many spaces to use when indenting the bodies of `if` and other similar statements. Recommended value is 4 spaces. 

Often the condition is written as a relation:
```python
x < y
x > y
x <= y
x >= y
x == y
x != y
x is None
x is not None
```

In [None]:
N = int(input("Enter N="))
if N > 0:
    print("You have entered a positive value")
if N <= 0:
    print("You have entered a negative value or zero")

### If-else statements

In the previous example the second test is redundant. If the condition `N > 0` is false, `N <= 0` is true automatically. Two calls of `print()` are mutually exclusive.

Instead of the second `if` it is better to use if-else statement 
```python
if <condition_is_true>:
    <do_something>
else:
    <do_other_things>
<continue_working>
```
Observe colons `:` ending both `if` and `else` lines. They must be there.

Observe the same indentations of `if` and `else` bodies.

Either `<do_something>` or `<do_other_things>` will be executed. Never both of them!

To fix the previous example we substitute the second `if` with `else` statement.

In [None]:
N = int(input("Enter N="))
if N > 0:
    print("You have entered a positive value")
else:
    print("You have entered a negative value or zero")

### If-elif-else statements

If there are more then two options:
```python
if <condition_1_is_true>:
    <do_something_1>
elif <condition_2_is_true>:
    <do_something_2>
elif <condition_3_is_true>:
    <do_something_3>
else:
    <do_other_things>
<continue_working>
```

Exactly one statement will be executed: either `<do_something_1>` or `<do_something_2>` or `<do_something_3>` or `<do_other_things>`.

Branch `else` can be omitted. In this case if non of the conditions is true if-elif block does nothing.

In [None]:
N = int(input("Enter N="))
if N > 0:
    print("You have entered a positive value")
elif N == 0:
    print("You have entered zero")
else:
    print("You have entered a negative value")

### Nested if


The body of if-elif-else block can contain almost any Python statements, including another if blocks. 

In [None]:
x = float(input("Enter x="))

if x < 0.0:
    descr = "negative"
elif x == 0.0:
    descr = "zero"
else:
    descr = "positive"
    if x < 1.0:
        descr = "small " + descr
    if x >= 10:
        descr = "big " + descr
print(f"You have eneterd a {descr} value")

### Conditional expression (ternary operator)

Decision can be made in with a single expression. It is called a conditional expression.
```python
a = <one_value> if <condition_is_true> else <another_value>
```
Conditional expressions make a code more compact and readable 

In [None]:
x = float(input("Inpute a number "))
a = "small" if abs(x) < 10 else "big"
print(f"You have entered a {a} number")

### Boolean logic

Results of relations can be modified and they can be combined using logical operators `not`, `and`, `or`.

Operator `not` switches a logical value. The truth table:

| x | not x |
| :-: | :-: |
| False | True |
| True | False|

Operator `and` compares two boolean values. The result is true only when both arguments are true. The truth table:

| x | y | x and y |
| :-: | :-: | :-: |
| False | False | False |
| False | True | False |
| True | False | False |
| True | True | True |

Operator `or` also compares two boolean values. The result is false only when both arguments are false. The truth table:

| x | y | x or y |
| :-: | :-: | :-: |
| False | False | False |
| False | True | True |
| True | False | True |
| True | True | True |


In [None]:
N = int(input('Enter N='))

if N == 2 or N == 4 or N == 6 or N == 8 or N == 10:
    print("You have entered an even number")
elif N >= 1 and N <= 9:
    print("You have entered an odd number")
else:
    print("Have no idea what you have entered")

### While loops

A `while` loop repeatedly executes its body statements as long as a condition evaluates to `True`.
```python
while <condition_is_true>:
    <do_something_again_and_again>
```

The body of the loop must be **indented more than the `while` keyword**. 

The body ends before the next line that is indented **the
same amount as (or less than) the `while` keyword**. 


In [None]:
N = int(input("Enter dividend N="))
K = int(input("Enter divisor  K="))

acum = 0
quotient = 0
remainder = 0
print()
while acum < N:
    acum += K
    quotient += 1
    print(f"watching: acum={acum:4}, quotient={quotient:4}, (acum<N)={acum<N}")

print()
if acum > N:
    quotient -= 1
    
remainder = N - quotient * K
    
print(f"Result: {N} = {K} * {quotient} + {remainder}")    

### For loops

Loop `for` also repeats the execution of its body and do it for each element of a collection:
```python
for <variable> in <collection>:
    <do_something_for_each_element_in_collection>
```

Observe the indentation. Rules are the same as for `if` and `while`.

The collection can be a range of integers, the letters in a string and some others.

Each element in the collection is copied into `<variable>` before each execution of the body.

A collection of integers can be constructed by calling function `range()`.

In [None]:
for i in range(5):  # i = [0, 1, 2, 3, 4]
    print(f"i={i}, i squared={i**2}")

In [None]:
for i in range(5, 10):  # i = [5, 6, 7, 8, 9]
    print(f"i={i}, i squared={i**2}")

In [None]:
for i in range(1, 11, 2):  # i = [1, 3, 5, 7, 9]
    print(f"i={i}, i squared={i**2}")

In [None]:
for i in range(10, 0, -2):  # i = [10, 8, 6, 4, 2]
    print(f"i={i}, i squared={i**2}")

Observe that range by default starts from zero. Last range number is ignored.

### Nested loops

Body of a loop can contain `if` statements as well as another loops. 

In [None]:
s = input("Enter a word (blank to quit)")

while s != "":
    for i in range(3):
        print(f"{s} ", end="")  # do not go to new line
    print()  # now go to new line when for-loop is finished
    
    s = input("Enter a word (blank to quit)")

### Break and continue

Previous code was not so good. We were forced to write two identical input statements. This is because the conditions is checked before execution the body. 

Some other programming languages have do-while loops where the condition is checked after the execution. But Python does have it.

But we can simulate do-while loop using is-statement and `break`.

Operator `break` immediately breaks loop repetitions. Works both with `for` and `while` loops.

In [None]:
acum = 0
while True:
    N = int(input("Enter N="))
    acum += N
    print(f"acum={acum}")
    if acum > 10:
        break

Operator `continue` stops current execution and jump to another loop repetition. Works both with `for` and `while` loops.

In [None]:
acum = 0
while True:
    N = int(input("Enter N="))
    if N < 0:
        print("Ignore negative values")
        continue
    acum += N
    print(f"acum={acum}")
    if acum > 10:
        break

Often we need to leave a loop from the middle point of a body. Let us improve the example from the previous section. Not duplicated inputs anymore.

In [None]:
while True:
    s = input("Enter a word (blank to quit)")
    if s == "":
        break
    
    N = 3
    print(f"Now I repeat it {N} times: ", end="")
    for i in range(N):
        print(f"{s} ", end="")  # do not go to new line
    print()  # now go to new line when for-loop is finished

## Lesson 3

### User defined functions

Let us recall why we need functions:

- to avoid retyping
- to increase clarity

Programmers can define their own functions. Splitting a code into many functions is a good programming style.

```python
def my_function(x, y):
    <do_something>
    return z
```

Keyword `def` starts a function definition. It is followed by a function name. Rules and limitations for function names are the same as for variables. Then function parameters are listed in parentheses. This line is ended by colon.

A function can have no parameters. Empty parentheses are required any way.

The function body must have an indent. 

The resulting variable, say `z`, is returned via `return z` statement. A function can have multiple `return` statements. And only one of them will be executed.

Function can return nothing. In this case one can write `return` statement without a returning value. Or one can even omit `return`. The function will end automatically when all body statements will be executed.

Variables defined in a function body are local and do not exist outside the function.

In [None]:
def my_round(x):
    # a function that rounds floats 
    return int((x + 0.5) // 1)

x = 12.34
i = my_round(x)
print(f"x={x}, my_round={i}")
x = 12.54
i = my_round(x)
print(f"x={x}, my_round={i}")

In [None]:
import datetime as dt

def currently():
    # a function without args
    today = dt.date.today()
    print("Today's date:", today)
    
currently()

In [None]:
def largest(x, y):
    # a function with multiple returns
    if x >= y:
        return x
    else:
        return y
    
x = 3.4
y = 5.6
z = largest(x, y)
print(f"x={x}, y={y}, largest={z}")

Observe describing comments located right after the function header.

A variable declared outside of any function is known as a global variable. This means that a global variable can be accessed inside or outside of functions.

In [None]:
N = 10  # it will be a global variable
W = 6  # this is also a global variable

def acum_squares():
    print("Function acum_squares starts")
    acum = 0
    for i in range(1, N + 1):  # we can freely read a value of the global variable
        isq = i**2
        acum += isq
        print(f"i={i:{W}}, isq={isq:{W}}, acum={acum:{W}}")  # observe curly brackets around W
    return acum

a = acum_squares()
print(f"a={a}")

In the above example using `W` makes sense. But global `N` is a bad practice. It must be a function parameter instead. Better implementation

In [None]:
W = 6

def acum_squares(N):
    print(f"Function acum_squares starts with N={N}")
    acum = 0
    for i in range(1, N + 1):  # we can freely read a value of the global variable
        isq = i**2
        acum += isq
        print(f"i={i:{W}}, isq={isq:{W}}, acum={acum:{W}}")  # observe curly brackets around W
    return acum

a = acum_squares(3)
print(f"a={a}")
a = acum_squares(4)
print(f"a={a}")

Sometimes we need to declare a global variable explicitly.

By default, Python creates all variables as local.

In [None]:
N = 1  # define global variable

def local_scope():
    N = 2  # N here is local and does not interfere with global N
    print(f"locally N={N}")
    return N

print(f"globally N={N}")
local_scope()
print(f"globally N={N}")

What if we want to modify a global variable inside a function? 

Simple assignment does not work because we create a new local variable instead.

The solution is is to declare a variable as global

In [None]:
N = 1

def incr():
    global N
    N = 2

print(f"before incr N={N}")
incr()
print(f"after incr N={N}")

Changing global variables inside functions is usually a bad practice. A program becomes more complicated. It becomes difficult to understand and control how it works.

Unlike many other programming languages parameters parameters can be passed to functions by their names.

Consider an example

In [None]:
def cylinder_volume(rad, height):
    # Compte volume of cylinder
    pi = 3.14159265359
    return pi * rad**2 * height

x = 4
y = 3
V = cylinder_volume(x, y)
print(f"rad={x}, height={y}, vol={V}")

Here the cylinder radius is `x` and the height is `y`. We must remember their order: radius goes first. If we forget and switch them the result is incorrect. 

It can be avoided by calling parameters by their names. Doing like this we can even change their order.

In [None]:
V1 = cylinder_volume(height=y, rad=x)
print(f"rad={x}, height={y}, vol={V1}")

Function parameters can have default values. These vales are passed to the function body when it is called without specifying these parameters.

Example: function that computes n-th term of arithmetic progression

In [None]:
def arithm_progr(a1, n, d=1):
    # n-th term of arithmetic progression
    return a1 + (n - 1) * d

n = 10
a1 = 5
d = 10
an = arithm_progr(a1, n, d)
print(f"n={n}, a1={a1}, an={an}, d={d}")

There is default value for d. If d is omitted in a function call, the default value will be taken

In [None]:
n = 13
a1 = 1

d = 1
an = arithm_progr(a1, n, d)
print(f"n={n}, a1={a1}, an={an}, d={d}")

an = arithm_progr(a1, n)
print(f"n={n}, a1={a1}, an={an}")

### Classes

Classes provide a means of bundling data and functionality together. Creating a new class creates a new type of object, allowing new instances of that type to be made. Each class instance can have attributes attached to it for maintaining its state. Class instances can also have methods (defined by its class) for modifying its state.

Keyword `class` starts class definition.

In [None]:
class Employee:
    count = 0  # class attribute

    def __init__(self, name, salary):
        # Constructor
        print("Constructor is called")
        self.name = name  # instance attribute
        self.salary = salary  # instance attribute
        self.id = Employee.count # instance attribute
        Employee.count += 1
        
    def __del__(self):
        # Destructor
        print("Destructor is called")
        Employee.count -= 1
   
    def __str__(self):
        print("Making string representation of the object")
        return f"Name: {self.name}, Salary: {self.salary}, Id: {self.id}"
    
    def promotion(self, percent):
        # User defined class method
        self.salary *= 1 + percent / 100

There are special methods. Their names are predefined in Python and are surrounded by two underscores. Above those methods are `__init__`, `__del__`, and `__str__`.

Programmer defines his/her own methods similarly to ordinary functions. Parameter `self` is a reference to the object itself.

When class has been defined one can create class instances: `<class_name>(<parameters>)`. The object instance is assigned to a variable.

In [None]:
# Recruiting two persons
emp1 = Employee("Peter", 1000)
emp2 = Employee("Razvan", 3000)
print(emp1)
print(emp2)
print(f"Total count: {Employee.count}")

Calling methods and attributes: `<object_variable>.<method_or_attribute_name>` 

In [None]:
# Increase salary to the first person
print(emp1.salary)
emp1.promotion(10)
print(emp1.salary)
print(emp1)

In [None]:
# Dismiss the second one
del emp2
print(emp2)

In [None]:
# Now there is only one person
print(f"Total count: {Employee.count}")

In [None]:
# Employ two new persons
emp2 = Employee("Anna", 3000)
emp3 = Employee("Helga", 4500)
print(f"Total count: {Employee.count}")

In [None]:
# Compute total payment to the employees 
acum = emp1.salary + emp2.salary + emp2.salary 
print(f"Total: {acum}")

In [None]:
# Greetings to everyone
s = "Hi " + emp1.name + ", " + emp2.name + " and " + emp3.name + "!\n"
s += "Happy New Year!"
print(s)

In [None]:
# Clean all
del emp1
del emp2
del emp3
Employee.count = 0

### Inheritance

New class can be created on the basis of another class. This is called inheritance.

Here is the copy of the class Employee, just for convenience.

In [None]:
class Employee:
    count = 0  # class attribute

    def __init__(self, name, salary):
        # Constructor
        self.name = name  # instance attribute
        self.salary = salary  # instance attribute
        self.id = Employee.count # instance attribute
        Employee.count += 1
        
    def __del__(self):
        # Destructor
        Employee.count -= 1
   
    def __str__(self):
        return f"Name: {self.name}, Salary: {self.salary}, Id: {self.id}"
    
    def promotion(self, percent):
        # User defined class method
        self.salary *= 1 + percent / 100

We are going to defined a new class `Intern` based on `Employee`. `Employee` is called parent class or superclass. 

When we inherit form the parent class we have access to its attribute and methods and can modify them and add some more attributes and methods.

In [None]:
class Intern(Employee):
    
    def __init__(self, name, salary, trial_period):
        super().__init__(name, salary)
        self.trial_period = trial_period
    
    def trial_period_salary(self):
        return 0.5 * self.salary 
    
    def __str__(self):
        s = super().__str__()
        a = self.trial_period_salary()
        return f"Intern {s}, TrialPeriod: {self.trial_period}, TrialPeriodSalary: {a}"

In [None]:
# One regular employee and one intern
emp1 = Employee("Elsa", 8000)
inn1 = Intern("John", 5000, 3)
print(emp1)
print(inn1)

In [None]:
del emp1
del inn1
Employee.count = 0

### Lists

All variables above held one value: integer, float, string and so on. But to process a large amount of data one needs a named container that holds several values. 

List is a sequence of values separated by commas and enclosed in square brackets. A list can be assigned to variable just like a simple value.

In [None]:
v = [1.2, 3.4, 7.8]
print(v)

A list can be empty. Empty list is needed when we are going to put values in there as the program executes.

In [None]:
v = []
print(v)

Size of a list can be computed using function `len`.

In [None]:
v = [4.4, 5.5, 6.6, 7.7]
print(len(v))
v = []
print(len(v))

### Getting and updating list elements

Elements of a list are numbered sequentially with integers, starting from 0. Each integer identifies a specific element in the list, and is referred to as the index for that element.

Elements are accessed by their indexes: One uses variable name followed by square brackets with the index inside. Elements can be read or updated in this way.

In [None]:
# Reading list elements
v = [1.2, 3.4, 7.8]
print(v[0])
print(v[1])
print(v[2])

In [None]:
# Updating elements
v = [1.2, 3.4, 7.8]
print(v)
v[0] = -12.3
print(v)

Sublists can be taken by index ranges: One uses lower and upper values of an index separated by colon.

In [None]:
v = [0, 10, 20, 30, 40, 50, 60, 70]
print(v[0:3])
print(v[:3])
print(v[2:6])
print(v[-3:])

### Heterogeneous lists

Lists admit elements of arbitrary types. Types can be mixed within one list. Even list can be an element of a list.

In [None]:
v = [12, 0.34, "mixture", True, [1,2,3]]
print(f"Value: {v[0]:9}, Type: {type(v[0])}")
print(f"Value: {v[1]:9}, Type: {type(v[1])}")
print(f"Value: {v[2]:9}, Type: {type(v[2])}")
print(f"Value: {v[3]:9}, Type: {type(v[3])}")
print(f"Value: {v[4]}, Type: {type(v[4])}")

A list holding data of various types can be updated with arbitrary types.

In [None]:
v = [12, 0.34, "mixture", True]
print(v)
v[2] = -1
print(v)

List with data of different types can be used to represent a table 

In [None]:
# name, salary, age, ages of children
tab = [
    ["Peter", 1000.12, 21, [2, 1]],
    ["Helga", 2500.79, 32, [4, 8, 1]],
    ["John", 1500.81, 24, [3, 2]]]

### Nested lists

If a list is an element of a list one gets its elements putting required index values in individual square brackets

In [None]:
v = [["one", "two","three"], ["twenty", "thirty", "forty"]]
print(v[1])
print(v[1][0])

In [None]:
v = [["one", "two", "three"], ["twenty", "thirty", "forty"]]
s = v[1][-1] + " " + v[0][1]
print(s)

### Loops and lists

Containers like list are useful for data processing with loops. 

A loop `for` admits a list as a collection of values for the loop variable. 

In [None]:
# Compute average of list elements
acum = 0
cnt = 0
for x in [2.3, 3.4, 4.5, -1.2]:
    acum += x
    cnt += 1
    
avg = acum / cnt
print(f"avg={avg}")

We can also run through a list using an index variable

In [None]:
lst = [2.3, 3.4, 4.5, -1.2]
acum = 0
cnt = 0
for i in range(len(lst)):
    acum += lst[i]
    cnt += 1
    
avg = acum / cnt
print(f"avg={avg}")

This way of list processing is not so fast in Python. The previous one is better: that was Pythonic way.

But what if we need element indexes. Use `enumerate`.

In [None]:
# Using enumerate to get indexes together with elements
for i, x in enumerate([221.23, 432.1, 431.121, 82.2, 991.211]):
    print(f"i={i}, x={x}")

One can iterate over two or more lists

In [None]:
x_list = [11, 12, 13]
y_list = ["one", "two", "three"]
for x, y in zip(x_list, y_list):
    print(f"x={x}, y={y}")

Using a table (list of list)

In [None]:
tab = [
    ["Peter", 1000.12, 21, [2, 1]],
    ["Helga", 2500.79, 32, [4, 8, 1]],
    ["John", 1500.81, 24, [3, 2]]]

tot = 0
for emp in tab:
    print(f"emp now is equal to {emp}")
    s = emp[1]
    print(f"   take the salary, s={s}")
    itx = s * 0.13
    print(f"   compute income tax, itx={itx}")
    tot += itx
    print(f"   tot={tot}")
    
print(f"Total income tax {tot}")

A loop `while` can also be useful for lists processing.

An example: Assume we have 8 cities connected by roads. Connected cities are encoded in list `road`: element `road[i]` contains a city number `j` that is connected with the city `i`. The program must find a path starting from city `n1` to city `n2`.

In [None]:
n1 = int(input("Start city n1="))
n2 = int(input("Fnish city n2="))

road = [4, 1, 3, 7, 6, 0, 2, 5]
STEPS_MAX = 10

i = n1
cnt = 0
while True:
    print(f"  go from city {i} to city {road[i]}")
    i = road[i]
    if i == n2:
        print("finish")
        break
    cnt += 1
    if cnt > STEPS_MAX:
        print(f"path not found after {cnt} steps")
        break
    

### Manipulations with lists

List can grow and shrink, new elements can be inserted, an element can be deleted. 

Lists are considered as objects, i.e., instances of the class "list". Most of the manipulations with lists are performed using methods.

In [None]:
# Add an element to the end of a list
x = [1, 2, 3, 4]
x.append(99)
print(x)

In [None]:
# Insert an element in arbitary location
x = [10, 9, 8, 7]
x.insert(2, 99)
print(x)

In [None]:
# Remove last element and store it in a variable
x = [1, 2, 3, 4, 5]
a = x.pop()
print(x)
print(a)

In [None]:
# Remove an element at a specific position
x = [1, 2, 3, 4, 5]
a = x.pop(3)
print(x)
print(a)

In [None]:
# Remove an element by value
x = [3.2, 4.3, 5.121, 4.3]
x.remove(5.121)
print(x)

In [None]:
# Reverse the order of elements
x = [1, 2, 3, 4, 5]
x.reverse()
print(x)

In [None]:
# Sort a list
x = [54, -1, 99, 21, -11]
x.sort()
print(x)

In [None]:
# Also sort a list, but creates a new copy
x = [54, -1, 99, 21, -11]
y = sorted(x)
print(x)
print(y)

In [None]:
# Determine whether or not a value is present in a list
x = [4.3,  55.4, 6.2, 4.3]
print(4.3 in x)
print(4.4 in x)

In [None]:
# Find the position of an element
x = [4.3,  55.4, 6.2, 4.3]
print(x.index(55.4))
print(x.index(55.5))

In [None]:
# Concatenate two lists
x = [1, 2, 3]
y = [4, 5, 6]
print(x + y)

In [None]:
# List propagation
x = [1, 2, " "]
print(x * 5)

In [None]:
# list comprehension
x = [1, 2, 3, 4, 5]
y = [i**2 for i in x]
print(y)

In [None]:
# list comprehension with filtering
x = [1, -1, 2, -2, 3, -3]
y = [i**2 for i in x if i > 0]
print(y)

An example: collecting numbers in a list

In [None]:
data = []
while True:
    s = input("Enter a value (blank to quit) ")
    if s == "":
        break
    data.append(float(s))
    
print(data)

### Tuples

Tuples are collection of elements enclosed in parentheses

In [None]:
t = (2, 3.4, "greate", [3, 4], (3.0, -3.0))

Access to tuple element - in the same way as for lists

In [None]:
t = (0, 10, 20, 30)
print(t[1])
print(t[-2])

The difference: tuples cannot be modified

In [None]:
t = ("one", "big", "pool")
t[1] = "small"  # Not allowed for tuples

List with one element is created in a natural way, but this is not the case for tuples.

In [None]:
v = [3]
print(f"v={v}, type={type(v)}")

In [None]:
t = (3)  # Type is int, not a tuple
print(f"t={t}, type={type(t)}")

Tuple with one element needs a comma

In [None]:
t = (3, )  # Now tuple
print(f"t={t}, type={type(t)}")

### Dictionaries

List is an enumerated collection of values. To read or update an element we must know its index.

Dictionary is a collection of named values. Each element is pair key-value. Key is a name of the value.

In [None]:
# Dictionary defentioon. Observe curly brackets and colons
emp = {"Helga": 1995, "John": 1992, "Igor": 1999}

# Add a value
emp["July"] = 1990

# Acces to elements by their names (keys)
print(emp["Helga"])
print(emp["Igor"])
print(emp["July"])

In [None]:
# Values can be updated
d = {}
d["first_counter"] = 1
d["second_counter"] = 100

d["first_counter"] += 1
d["second_counter"] += 100

print(d["first_counter"], d["second_counter"])

In [None]:
# Remove an element. Romoved value is returned
emp = {"Helga": 1995, "John": 1992, "Igor": 1999}
x = emp.pop("Igor")
print(x)

In [None]:
# Iteration over keys
emp = {"Helga": 1995, "John": 1992, "Igor": 1999, "July": 1990}

for x in emp:
    print(x, emp[x])

In [None]:
# Iteration over values
emp = {"Helga": 1995, "John": 1992, "Igor": 1999, "July": 1990}

for x in emp.values():
    print(x)

### Errors handling

The simplest way to fight errors is to check potentially problematic values with `assert` statement
```python
assert <condition_that_is_expected_to_be_true>, <message_to_show_when_error>
```

The second parameter, the message, can be omitted.

If the condition is evaluated to `False` computation stops. 

In [None]:
while True:
    x = float(input("Never input negative values! x="))
    assert x >= 0.0, "I'v told you to avoid negative values!"
    print(f"x={x}")

Observe that parameters of the assert must not be enclosed in parentheses. Otherwise it will not work correctly.

More power way of errors handling is using `try-except` construct.

Each incorrect operation raises so called exception that causes an abnormal program termination. By the way: `assert` also raises an exception to stop a program.

There is a way to handle exceptions. If we use `try-except` construct and an exception occurs within its try-section, the execution is dropped into its except-section where we can fix the problem to continue a normal execution.

```python
try:
    <some_code_where_exception_can_occur>
except <exception_name>:
    <do_something_to_process_error_caused_exception>
except <another_exception_name>:
    <process_another_error>
```

Consider a program with an exception:

In [None]:
# Program stops when zero is encountered
div = [1.1, 2.1, 0.0, 3.2]
for d in div:
    x = 10 / d;
    print(f"10 / {d} = {x}")

Find the exception name: `ZeroDivisionError`. Use it to create the exception handler.

In [None]:
# Now program runs to the end
div = [1.1, 2.1, 0.0, 3.2]
for d in div:
    try:
        x = 10 / d;
    except ZeroDivisionError:
        x = float("inf")
    print(f"10 / {d} = {x}")

## Lesson 4

### Arrays

List of elements of one type corresponds to arrays in other programming languages

In [None]:
v1 = [1, 2, -5, 5, 6]  # Array of integers
v2 = [3.2, -1.2, 1.3e-4, 2.33]  # Array of real numbers
v3 = ["big", "small", "left", "right"]  # Array of strings

v4 = ["left", 12, True]  # Just a list, but not an array

These arrays are one-dimensional. One index is required to access their elements

In [None]:
v = [1, 2, -5, 5, 6]
print(v[3])
print(v[-1])

Recall: size of array is obtained via function `len`

In [None]:
v = [45, 32, 56, 90]
print(len(v))

A one element tuple with the array size is its shape. The shape of `v` is `(4,)`

### Multidimensional arrays

List of arrays of one type and same sizes is two-dimensional array.

In [None]:
d1 = [[10, 20, 30], [40, 50, 60], [70, 80, 90], [100, 110, 120]]

List of lists `d1` above is two-dimensional array: 

- it contains 4 elements
- each element is a list of 3 elements
- all elements are `int`

Size of the first dimension of `d1` is 4, size of the second dimension is 3.

Shape of two dimensional array contains two elements: sizes of the first and the second dimensions. For `d1` the shape is `(4, 3)`.

Recall: elements are accessed via indexes enclosed in individual square brackets

In [None]:
d1 = [[10, 20, 30], [40, 50, 60], [70, 80, 90], [100, 110, 120]]
print(d1[0])
print(d1[0][1])

If we replace each element of `d1` above with some other lists we will obtain a three dimensional array. Its shape will contain three elements. 

Working with multidimensional arrays created as lists of lists of lists and so on is not so convenient.

### Arrays in NumPy

NumPy is a Python package. It stands for 'Numerical Python'. It is a library consisting of multidimensional array objects and a collection of routines for processing of array.

NumPy provides a power tool set for working with arrays including multidimensional ones.

Arrays can be created from a list using the function `numpy.array`

In [None]:
import numpy as np 
a = np.array([1, 2, 3]) 
print(a)

NumPy array is an object, i.e., an instance of a class. The class name is `numpy.ndarray`.

In [None]:
import numpy as np 
a = np.array([1, 2, 3]) 
print(type(a))

Different ways to create NumPy array

In [None]:
import numpy as np 
a = np.arange(12)  # range of elements
print(f"np.arange -> {a}")

In [None]:
a = np.arange(1, 13)  # range of elements
print(f"np.arange -> {a}")

In [None]:
a = np.arange(1, 13, 3)  # range of elements
print(f"np.arange -> {a}")

In [None]:
b = np.linspace(5, 10, 5)  # given number of reals within a given range
print(f"np.linspace -> {b}")

In [None]:
c = np.zeros(10)
print(f"np.zeros -> {c}")  # given number of zeros

In [None]:
c = np.ones(10)
print(f"np.zeros -> {c}")  # given number of ones

In [None]:
c = np.full(10, 999)
print(f"np.full -> {c}")  # given number of given values

### Attributes of a NumPy array

Being an object, a NumPy array can tell a lot of information about itself.
- `.shape` Shape of the array
- `.dtype` Type of array elements
- `.ndim` Number of array dimensions
- `.size` Total number of elements

In [None]:
import numpy as np 
a = np.array([[1.0, 2.0, 3.0], [1.0, 2.0, 3.0], [1.0, 2.0, 3.0], [1.0, 2.0, 3.0], [1.0, 2.0, 3.0]])
print(f"Numer of dimensions {a.ndim}, shape {a.shape}, number of elements {a.size}, type of elements {a.dtype}")

### Reshaping NumPy arrays

In some cases an array needs to be reshaped. 

Reshaping means different splitting to dimensions of the same data

Method `.reshape` is used for this

In [None]:
import numpy as np 

# two-dimensional array (4, 3)
a = np.array([[1.0, 2.0, 3.0], [1.0, 2.0, 3.0], [1.0, 2.0, 3.0], [1.0, 2.0, 3.0]])
print(a.shape)

In [None]:
# reshaping to (2, 6)
b = a.reshape(2,6)
print(b.shape)
print(b)

In [None]:
# now reshape it to a flat array (one-dimensional)
c = a.reshape(12)
print(c.shape)
print(c)

When reshaping, one dimension can be substituted with `-1`. Then it will be computed

In [None]:
import numpy as np 
a = np.arange(24)  # flat array (one-dimensional)
print(a)
print(a.shape)

In [None]:
# now reshape it to two-dimensional array
# the first dimension must be 8, the second one is computed
b = a.reshape(8, -1)
print(b)
print(b.shape)

### Getting and updating  NumPy array elements

Elements of a NumPy array are selected using zero based indexes. For one-dimensional arrays all the same as for lists.

In [None]:
import numpy as np 
a = np.arange(10)
print(a)
print(a[3])

In [None]:
a[3] = -3
print(a)

In [None]:
b = a[2:5]
print(b)

For multidimensional arrays all indexes are enclosed in square brackets and separated by comma. (Recall that for nested lists we enclose each index in square brackets).

In [None]:
import numpy as np
a = np.arange(12).reshape(3, 4)
print(a)

In [None]:
print(a[0, 0], a[0, 1], a[0, 2], a[0, 3])
print(a[2, 0], a[2, 1], a[2, 2], a[2, 3])

One value can be assigned to a multiple elements of a NumPy array at once

In [None]:
import numpy as np
a = np.arange(12)
print(a)
a[3:-3] = 100
print(a)

### Arithmetic of  NumPy arrays

Arrays admits elementwise operations using arithmetic operators `+`, `-`, `*`, `/`.

An array and a number

In [None]:
import numpy as np
x = np.array([2.0, 3.0, 4.0, 5.0])

In [None]:
y = x + 5.0
print(y)

In [None]:
y = x * 10.0
print(y)

Two arrays

In [None]:
import numpy as np
x1 = np.array([2.0, 3.0, 4.0, 5.0])
x2 = np.array([6.0, 7.0, 8.0, 9.0])

In [None]:
y = x1 + x2
print(y)

In [None]:
y = x1 * x2
print(y)

Arrays must have identical sizes

In [None]:
import numpy as np
x1 = np.array([2.0, 3.0, 4.0])
x2 = np.array([6.0, 7.0, 8.0, 9.0])
y = x1 + x2  # Error 

### Functions and  NumPy arrays

Math functions are also applied elementwise

In [None]:
import numpy as np

x = np.linspace(0, 12, 5)
y = x**2
print(x)
print(y)

In [None]:
x = np.linspace(0, 1, 5)
y = np.sin(x * np.pi)
print(x)
print(y)

### Reducing  NumPy arrays

There are functions that aggregate array into a single value. Also these functions are called reducing ones

In [None]:
import numpy as np

x1 = np.array([2.0, 3.0, 4.0])
a = x1.sum()
print(f"sum -> {a}")

In [None]:
x1 = np.array([2.0, 3.0, 4.0])
a = x1.prod()
print(f"prod -> {a}")

In [None]:
x1 = np.array([2.0, 3.0, 4.0])
a = x1.mean()
print(f"mean -> {a}")

In [None]:
x1 = np.array([2.0, 3.0, 4.0])
a = x1.max()
print(f"max -> {a}")

In [None]:
x1 = np.array([2.0, 3.0, 4.0])
a = x1.min()
print(f"min -> {a}")

### Random numbers in a nutshell

If we toss a coin it falls in either a head or a tail facing upwards. We write 0 for heads and 1 for tails (or vice versa). After many repetition we obtain a random sequence of 0 and 1.

We can model it as follows:

In [None]:
import numpy as np

rng = np.random.default_rng()  # we create a rundom number generator
rand_seq = rng.integers(2, size=25)  # the generator produces an array of 25 random 0 or 1
print(f"rand_seq={rand_seq}")

Observe that re-running this code results in different sequences. 

In [None]:
rand_seq = rng.integers(2, size=25)
print(f"rand_seq={rand_seq}")
rand_seq = rng.integers(2, size=25)
print(f"rand_seq={rand_seq}")
rand_seq = rng.integers(2, size=25)
print(f"rand_seq={rand_seq}")

We can also generate random floats. The numbers are generated in a range `[0,1)`

In [None]:
rand_flt = rng.random(size=10)  # 10 random floats
print(f"rand_flt={rand_flt}")

### Files

Results of computations can be saved to files. Before use the file must be opened with function `open()`. It has two parameters: file name and mode. Mode can be reading and writing (and some others). This function returns an object representing the file. 

- Writing data: mode='w', method `.write()`
- Reading data: mode='r', method `.read()`

After using a file it needs to be closed with the method `.close()`. Otherwise some data may be lost.

Writing a file one must remember to end lines at appropriate positions. Symbol is `\n` is used for it.

In [None]:
# Create the simplest file
fn = "my_file1.txt"
fdat = open(fn, "w")
fdat.write("O tempora!\nO mores!\n")
fdat.close()

In [None]:
# Download the file from Colab to see it or just go to it and open
try:
    from google.colab import files
    files.download(fn)
except ModuleNotFoundError:
    import os
    print(f"You are not in Colab. Just locate your file at\n{os.path.join(os.getcwd(), fn)}")

In [None]:
# Beter way of working with files: never forget to close it
fn = "my_file2.txt"
with open(fn, 'w') as fdat:
    fdat.write("Veni,\nvidi,\nvici\n")

In [None]:
# Download the file from Colab or just go to it
try:
    from google.colab import files
    files.download(fn)
except ModuleNotFoundError:    
    import os
    print(f"You are not in Colab. Just locate your file at\n{os.path.join(os.getcwd(), fn)}")

In [None]:
# Write a file in a loop
fn = "my_file3.txt"
with open(fn, 'w') as fdat:
    for i in range(10):
        fdat.write(f"i = {i}, 2**i = {2**i}\n")

In [None]:
# Download the file from Colab or just go to it
try:
    from google.colab import files
    files.download(fn)
except ModuleNotFoundError:    
    import os
    print(f"You are not in Colab. Just locate your file at\n{os.path.join(os.getcwd(), fn)}")

In [None]:
# Example of reading file. First write it
fn = "my_file4.txt"
txt = \
"""When, in disgrace with fortune and men's eyes,
I all alone beweep my outcast state,
And trouble deaf heaven with my bootless cries,
And look upon myself, and curse my fate,
Wishing me like to one more rich in hope,
Featured like him, like him with friends possessed,
Desiring this man's art and that man's scope,
With what I most enjoy contented least;
Yet in these thoughts myself almost despising,
Haply I think on thee - and then my state,
Like to the lark at break of day arising
From sullen earth, sings hymns at heaven's gate;
For thy sweet love rememb'red such wealth brings
That then I scorn to change my state with kings."""
with open(fn, "w") as fdat:
    fdat.write(txt)

In [None]:
# Read it all at once
fn = "my_file4.txt"
with open(fn, "r") as fdat:
    s = fdat.read()
    
print(s)

If a file is large it is better to read it line by line. For printing we take only 10 first symbols from each line.

In [None]:
# Here we read a file line by line
fn = "my_file4.txt"
CUT = 10
with open(fn, "r") as fdat:
    for line in fdat:
        s = line[:CUT]
        print(s)

We can enumerate lines using `enumerate`.

In [None]:
# Here we enumerate lines, compute lengths and print first 10 symbols
fn = "my_file4.txt"
CUT = 10
with open(fn, "r") as fdat:
    for i, line in enumerate(fdat):
        n = len(line)
        s = line[:CUT]
        print(f"{i+1:3}: {s}..., len={n}")

In [None]:
# Here we compute number of 'a' in each line
fn = "my_file4.txt"
SYMB = "a"
tot = 0
with open(fn, "r") as fdat:
    for i, line in enumerate(fdat):
        cnt = 0
        for s in line:
            if s == SYMB:
                cnt += 1
            
        tot += cnt
        print(f"Number of '{SYMB}' in line {i:3} is {cnt:3}")

print(f"Total number of '{SYMB}' is {tot}")

### CSV files

CSV is a simple file format used to store tabular data. Files in the CSV format can be imported to and exported from programs that store data in tables, such as Microsoft Excel or OpenOffice Calc. CSV stands for "comma-separated values".


In [None]:
# Run this section to have a csv file in your working directory
file_name = "areas.csv"

# 20 largest countries: name (str) and area in km squared (float)
txt = ["Iran, 1648000","Libya, 1759540","Indonesia, 1919440","Saudi_Arabia, 1960582","Mexico, 1972550","Greenland, 2166086","Congo_Democratic_Republic_of_the, 2345410","Algeria, 2381740","Sudan, 2505810","Kazakhstan, 2717300","Argentina, 2766890","India, 3287590","#European_Union, 3976372","Australia, 7686850","Brazil, 8511965","China, 9596960","United_States, 9631418","Canada, 9984670","Antarctica, 14000000","Russia, 17075200"]
with open(file_name, 'w') as f:
    for c in txt:
        f.write(c + '\n')

How to load a CSV. Though some modules are available, we are going to do it by hands.
 
First let us familiarize with a string splitting by a delimiter. Strings have a method `.split` for this.

In [None]:
s = "one, two, three"
v = s.split(',')
print(v)

Observe the spaces before "two" and "three". We use list comprehension and string method `.strip` to drop them out

In [None]:
v1 = [s.strip() for s in v]
print(v1)

Now we are ready to load CSV as a list

In [None]:
# First step: see what we read from file
file_name = "areas.csv"

with open(file_name, 'r') as f:
    for line in f:
        print(line, end="") # line is a string corresponding to a whole line in the file        

In [None]:
# Split each line by comma
file_name = "areas.csv"

with open(file_name, 'r') as f:
    for line in f:
        s = line.split(',')
        print(s)

In [None]:
# Do clearning by strip. Observed that strip removes both spaces and new lines
file_name = "areas.csv"

with open(file_name, 'r') as f:
    for line in f:
        s = line.split(',')
        s = [c.strip() for c in s]
        print(s)

In [None]:
# Convert of areas to float
file_name = "areas.csv"

with open(file_name, 'r') as f:
    for line in f:
        s = line.split(',')
        s = [c.strip() for c in s]
        s[1] = float(s[1])
        print(s)

In [None]:
# Collect all to a list
file_name = "areas.csv"

with open(file_name, 'r') as f:
    dat = []
    for line in f:
        s = line.split(',')
        s = [c.strip() for c in s]
        s[1] = float(s[1])
        dat.append(s)

print(dat)

It can be represented as a dictionary

In [None]:
ddat = dict(dat)
print(ddat)
print(ddat['Russia'])
print(ddat['Iran'])

### Clear temporary files

In [None]:
import os

temp_files = ["my_file1.txt", "my_file2.txt", "my_file3.txt", "my_file4.txt", 
              "areas.csv", "presidents.csv", "sleep.csv"]

ask = input("Do you really want to remove temporary files? (y/n)")
if ask[0].lower() == 'y':
    flag = False
    for file in temp_files:
        try:
            os.remove(file)
            print(f"Removed {file}")
            flag = True
        except FileNotFoundError:
            pass
    if not flag:
        print("No files to remove")

## Lesson 5

### Native Python string find and replace methods

Assume that you need to find some word in a text. 

Maybe also you need to replace it with another word.

Python strings provide simple tools for it.

The following example shows how to find a substring.

In [None]:
txt = """The English Wikipedia was the first Wikipedia edition and has
remained the lagest."""

Method `.find(sub, beg=0, end=len(txt))` search for the first occurrence of the substring. It returns a position of the substring or -1 if fails.

In [None]:
pos = txt.find("Wiki")
print(pos)
print(txt[pos:])

If we find the next occurrence we can start searching from `pos+1`

In [None]:
pos2 = txt.find("Wiki", pos + 1)
print(pos2)
print(txt[pos2:])

There is a version of the search method that finds the last occurrence of the substring.

In [None]:
pos = txt.rfind("the")
print(pos)
print(txt[pos:])

When we want an occurrence before the previous we do as follows:

In [None]:
pos2 = txt.rfind("the", 0, pos)
print(pos2)
print(txt[pos2:])

Replacing of the substring is done using `.replace(old, new, count=-1)` method.

Let us fix a typo in our string:

In [None]:
txt.replace("lagest", "largest")

### Regular expressions

Regular expressions extends the possibilities provided by the find and replace methods.

Regular expressions are sequences of characters symbols used to perform find-and-replace operations.

The power of the regular expressions is that they allows to find sets of somehow similar substrings.

For example the symbol `\w` corresponds to any a alphanumeric character, and `\d` matches any digit.

Working with regular expressions in Python is done via the standard module `re`. 

In addition to the regular expression support, this module comes with a power tools for finding and replacing.

- `re.match()`
- `re.search()`
- `re.findall()`
- `re.split()`
- `re.sub()`
- `re.compile()`

Object `Match` is an object containing information about the search and the result.

Some of the functions of the module `re` returns result as the 
object `Match`, others return mere list of strings.

Function `re.match(pattern, string)` finds the occurrence of a pattern at the beginning of a string. 

It returns an object `Match`.

In [None]:
import re

txt = "When in Rome, do as the Romans"

mtch = re.match(r"When", txt)
print(mtch)

`Match` has a method `.group()` that gives the found pattern:

In [None]:
print("Found pattern:", mtch.group())

The name of the method `.group()` is unclear now. 

When our patters is a plain text as above the only one group is always found. 

But when we specify the pattern as a regular expression, we may want not only find it but also to dissect strings into several parts that match different components of interest.

The found parts are returned by `Match` as groups.

The examples of a nontrivial using of groups will be below.

`Match` can also get positions of the beginning and the end of the pattern, as well as its span:

In [None]:
print(mtch.start(), mtch.end(), mtch.span())

If we try to find another word, the search fails: the method `.match()` checks only the beginning of the string.

In [None]:
mtch = re.match(r"Rome", txt)
print(mtch)

Observe that patterns must be specified with r prefix

```python
r"Hello\n Good bye"
```

This is to protect '\\' from treating it as a special symbol.

Let us check:

In [None]:
s1 = "\nHello\nGood bye"
s2 = r"\nHello\nGood bye"
print("Normal string:", s1)
print()
print("Raw string:", s2)

Function `re.search(pattern, string)` searches the whole string and returns `Match` the first occurrence of the pattern.

In [None]:
import re

txt = "Hope for the best, but prepare for the worst"

mtch = re.search(r"for", txt)
print(mtch.group(), mtch.span())

Function `re.findall(pattern, string)` returns a list of all occurrences of the patter.

Observe that this function returns a list, not an object `Match`.

In [None]:
import re

txt = "Keep your friends close and your enemies closer"

lst = re.findall(r"close", txt)
print(lst)

Function `re.split(pattern, string, maxsplit=0)` splits a string by the pattern. If `maxsplit` is zero (by default) there will be as many splitting as possible. Otherwise the number of splitting will be limited.

In [None]:
import re

txt = "One man's trash is another man's treasure"

spl = re.split(r"man", txt)
print(spl)

spl = re.split(r" ", txt)
print(spl)

spl = re.split(r" ", txt, 3)
print(spl)

If a massive search is performed it is recommended to compile
a pattern before applying: `re.compile(pattern)`.

The returned is an object `RegexObject`. 

It has its own search methods.

In [None]:
import re

txt1 = "If you can't beat them, join them"
txt2 = "You can't judge a book by its cover"
txt3 = "You can lead a horse to water, but you can't make him drink"

rge = re.compile(r"can")
print(rge.findall(txt1))
print(rge.findall(txt2))
print(rge.findall(txt3))

In what follows we will always consider the compiled patterns.

Let us now discuss the regular expressions. 

The regular expressions are built of the special symbols matching one or many different characters.

- `\w` : Matches with an alphanumeric character 
- `\d` : Matches with digits \[0-9\]
- `\s` : Matches with a single white space character (space, newline, tab)

An example is below. 

Observe that each of these symbols matches with only one character.

Also notice that the exclamation point is not matched at all.

In [None]:
import re 

txt = "Agent 007!"

rge_w = re.compile(r"\w")
rge_d = re.compile(r"\d")
rge_s = re.compile(r"\s")

print(rge_w.findall(txt))
print(rge_d.findall(txt))
print(rge_s.findall(txt))

Capital-letter version of these paterns means the negation:
    
- `\W` : Matches with not an alphanumeric character 
- `\D` : Matches with not digits \[0-9\]
- `\S` : Matches with not a single white space character (space, newline, tab)        

In [None]:
import re 

txt = "Agent 007!"

rge_W = re.compile(r"\W")
rge_D = re.compile(r"\D")
rge_S = re.compile(r"\S")

print(rge_W.findall(txt))
print(rge_D.findall(txt))
print(rge_S.findall(txt))

The pattern `\W` detects all non alphanumerical symbols: space and the exclamation point

The pattern `\D` returns all non digits: these are letters, space and the exclamation point.

Finally `\S` finds all non space symbols.

We can specify particular characters that we want to match:

- `[..]` : Matches with any single character in square brackets
- `[^..]` : The negation: matches with any single character not in square brackets

In [None]:
import re 

txt = "experimentalist"

rge_c = re.compile(r"[aei]")
rge_C = re.compile(r"[^aei]")

print(rge_c.findall(txt))
print(rge_C.findall(txt))

Square brackets admit range specification via `-` (minus) sign

- `[a-d]` : Matches characters from a to d
- `[a-zA-Z]` : Matches all Latin letters

Observe that `\w` matches both letters and digits and the range `[a-zA-Z]` allows to get only letters.

In [None]:
import re 

txt = "Agent 007!"

rge_w = re.compile(r"\w")
rge_l = re.compile(r"[a-zA-Z]")

print(rge_w.findall(txt))
print(rge_l.findall(txt))

Finally any single character except new line is matched like this:

- `.` (period) : Matches any single character except newline
- `\n` : Matches newline symbols

In the example below the string `txt1` is a raw string so that `\n` in the middle is considered by Python literally 
as back slash and character `n`.

And `txt2` is a plain string where Python treat `\n` as a newline symbol.

Observe how period-pattern process these strings. 

It matches all symbol from the first string since no newline symbols are there.

And it misses a newline symbol in the second string.

In [None]:
import re 

txt1 = r"Two roads diverged in a yellow wood,\nAnd sorry I could not travel both"
txt2 = "Two roads diverged in a yellow wood,\nAnd sorry I could not travel both"

rge_p = re.compile(r".")

print(rge_p.findall(txt1))
print(rge_p.findall(txt2))

Accordingly, if we try to find a newline symbols we will find one only in the second string:

In [None]:
rge_n = re.compile(r"\n")

print(rge_n.findall(txt1))
print(rge_n.findall(txt2))

The single pattern symbols can be combined together and with plain characters:

In [None]:
import re 

txt = """The longest recorded rated chess game in history: 
Ivan Nikolic vs. Goran Arsovic, 17 Feb 1989. 
1. d4 Nf6 2. c4 g6 3. Nc3 Bg7 4. e4 d6 5. Nf3 O-O 6. Be2 Nbd7
etc
"""

rge1 = re.compile(r"[a-zA-Z]\d")
rge2 = re.compile(r"\d\d \w\w\w \d\d\d\d")

print(rge1.findall(txt))
print(rge2.findall(txt))

In the above example we have found a data repeating `\d` and `\w`. 

In general repeating the pattern symbol can be not so convenient.

To match the repeated character we can use special symbols:

- `?` : Matches 0 or 1 occurrence of the pattern to its left
- `+` : Matches 1 or more occurrences of the pattern to its left
- `*` : Matches 0 or more occurrences of the pattern to its left
- `{n,m}` : Matches at least n and at most m occurrences of preceding expression. 
- `{,m}` : Matches minimum m occurrences of preceding expression. Zero occurrences are also matched.
- `{n,}` : Matches at least n or more occurrences of preceding expression.
- `{n}` : Matches exactly n occurrences of preceding expression.

So another version of a pattern to extract the date above is as follows:

In [None]:
rge3 = re.compile(r"\d+ [a-zA-Z]+ \d+")

print(rge3.findall(txt))

More exact specification:

In [None]:
rge3 = re.compile(r"\d{2} [a-zA-Z]{3} \d{4}")

print(rge3.findall(txt))

Following pattern symbols match start and end of string:

- `^` : Matches the start the string.
- `$` : Matches the end the string.

The example below uses pattern `\w+` to match all words separated by space symbols.

The patterns sounds as follows: "Find each 1 or more occurrence (`+`) in a row of alphanumerical symbols (`\w`)"

In [None]:
import re

txt = """I've a cat named Vesters,
And he eats all day.
He always lays around,
And never wants to play.

Not even with a squeaky toy, 
Nor anything that moves.
When I have him exercise,
He always disapproves.

So we've put him on a diet,
But now he yells all day.
And even though he's thinner,
He still won't come and play.
"""

rge1 = re.compile(r'\w+')
print(rge1.findall(txt))

Now let us try find all worlds in the line starts by adding `^` symbol before the pattern.

In [None]:
rge2 = re.compile(r'^\w+')
print(rge2.findall(txt))

It has found only the very first word because this is the beginning of the string.

If we want to find all words at line beginnings after each line break we need to switch the search to
multiline mode:

In [None]:
rge3 = re.compile(r'^\w+', re.MULTILINE)
print(rge3.findall(txt))

Complex pattern can be combined with logical operator Or:

- `a | b` : Matches either a or b

Assume that we have a text with dates in different formats. 

The following pattern will extract all of them:

In [None]:
import re

txt = """Writers have traditionally written abbreviated dates according 
to their local custom, creating all-numeric equivalents to dates such as, 
"15 February 2021" (15/02/21, 15/02/2021, 15-02-2021 or 15.02.2021)
"""

rge = re.compile(r"\d{2}\s\w+\s\d{4}|\d{2}[/-]\d{2}[/-]\d+|\d{2}\.\d{2}\.\d+")

print(rge.findall(txt))

Here there are three patterns combined with logical or `|`:

- `\d{2}\s\w+\s\d{4}` : Two digits, space, a word, four digits
- `\d{2}[/-]\d{2}[/-]\d+` : Two digits, slash or minis, two digits, slash or minis, one or more digits (need this to match both 21 and 2021)
- `\d{2}\.\d{2}\.\d+` : Two digits, period protected by a backslash (to treat it as a character and not as patterns symbol), two digits, protected period, one or more digits

As we have seen special symbols like `.`, `?` or `*` can be used as plain characters when protected by backslash:

- `\.` `\?` `\+` `\*` : Match special symbols as plain characters.

Below is another illustration of pattern search.

The pattern `\w+,?\s\w+` matches the following sequences:

- `\w+` : one or more alphanumeric letters - actually matches a word
- `,?` : one comma or no comma
- `\s` : space or newline symbol
- `\w+` : again one or more alphanumeric letters, i.e, a word again

This patters splits the string into a pairs of successive words:

In [None]:
import re

txt = """Mary had a little lamb,
Little lamb, little lamb,
Mary had a little lamb
Whose fleece was white as snow.
"""

rge = re.compile(r"\w+,?\s\w+")
print(rge.findall(txt))

The boldface highlighting helps to clarify what was found:

**Mary had** a little **lamb,
Little** lamb, little **lamb,
Mary** had a **little lamb**
Whose fleece **was white** as snow.


Round brackets do exactly what we think they should do - they group patterns:

- `(`, `)` : Create a group of pattern symbols

One example of using round brackets. 

Almost the same pattern but the words are grouped by brackets. 

Observed that now the matched words are extracted separately and the middle spaces, commas and newlines are omitted:

In [None]:
rge = re.compile(r"(\w+),?\s(\w+)")

print(rge.findall(txt))

And in this pattern the middle spaces, commas and newlines are grouped instead:

In [None]:
rge = re.compile(r"\w+(,?\s)\w+")

print(rge.findall(txt))

In this example we use regular expressions to extract names of English kings.

The pattern `\w+\s+\w+\s+[IV]{1,3}` includes the following parts:

- `\w+` : a word
- `\s+` : one or more space or newline in a row
- `\w+` : a word again
- `\s+` : again spaces and/or newlines
- `[IV]{1,3}` : one, two or three characters I or V - naive roman number matcher.

This is the analyzed text with the highlighted king names:

"The Principality of Wales was incorporated into the Kingdom of England under the 
Statute of Rhuddlan in 1284, and in 1301 **King Edward I** invested his eldest son, 
the future **King Edward II**, as Prince of Wales. Since that time, except for **King 
Edward III**, the eldest sons of all English monarchs have borne this title.

After the death of **Queen Elizabeth I** without issue, in 1603, **King James VI** 
of Scotland also became **James I** of England, joining the crowns of England 
and Scotland in personal union."

Now the search:

In [None]:
import re

txt = """
The Principality of Wales was incorporated into the Kingdom of England under the 
Statute of Rhuddlan in 1284, and in 1301 King Edward I invested his eldest son, 
the future King Edward II, as Prince of Wales. Since that time, except for King 
Edward III, the eldest sons of all English monarchs have borne this title.

After the death of Queen Elizabeth I without issue, in 1603, King James VI 
of Scotland also became James I of England, joining the crowns of England 
and Scotland in personal union. 
"""

rge1 = re.compile(r"\w+\s+\w+\s+[IV]{1,3}")

print(rge1.findall(txt))

Notice that the "King Edward III" is split by a newline. The pattern `\s+` processes it correctly.

Now the same pattern with the grouped parts responsible for a personal name and a number: 

In [None]:
rge1 = re.compile(r"\w+\s+(\w+)\s+([IV]{1,3})")
print(rge1.findall(txt))

Now we discuss the function `re.sub(pattern, repl, string)`.

If finds the pattern in the string and replace it with `repl`.

`repl` can be a string or a function. 

If this is a string, backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern. 

First, a trivial example without regular expressions:

In [None]:
import re

txt = "Keep your friends close and your enemies closer"

sbs = re.sub(r"close", "distant", txt)
print(sbs)

Consider now using patterns.

The pattern `(\d+)\s*\+\s*(\d+)\s*=\s*(\d+)[;,]?\s*` below matches arithmetical expressions and extracts the numbers from it.

It contains three parts taken into round brackets. The text matched by the corresponding patterns are called groups.

- `(\d+)` : an integer number; it will be the group 1 since goes first in a row
- `\s*\+\s*` : a plus sign protected by a backslash and surrounded by optional spaces
- `(\d+)` : an integer number; it will be the group 2
- `\s*=\s*` : an equal sign surrounded by optional spaces
- `(\d+)` : an integer number; it will be the group 3
- `[;,]?\s*` : separators - optional comma, semicolon and spaces

In [None]:
import re

txt = '1 + 2 = 3, 3+ 4 = 7; 7+8=15'
pat = r"(\d+)\s*\+\s*(\d+)\s*=\s*(\d+)[;,]?\s*"

rge = re.compile(pat)
rge.findall(txt)

Now we will use the function `sub` to substitute `+` and `=` signs with their verbalizations.

Also we want to drop out all separators and change them to newline symbols.

Observe a key point here: the groups is substituted to the replacement string as `\n` where `n` 
is a number of group.

In [None]:
s = re.sub(pat, r"\1 plus \2 equals \3\n", txt)
print(s)

This substitution can be done simpler. 

Instead of using groups we could substitute `+` and `=` by `plus` and `equals`, respectively and separators could be substituted with newlines. 

But the following already can not be done without grouping parts in the pattern:

In [None]:
s = re.sub(pat, r"\3 minus \2 equals \1\n", txt)
print(s)

If `repl` parameter in the function `sub` is itself a function it is called for every occurrence of the pattern. The function takes a single match object argument, and returns the replacement string.

The example below takes a string with an arithmetical expression in it and substitute the expression with its result.

The pattern `(\d+)\s*\+\s*(\d+)` contains the following parts:

- `(\d+)` : an integer number; it will be the group 1
- `\s*\+\s*` : a plus sign protected by a backslash and surrounded by optional spaces
- `(\d+)` : an integer number; it will be the group 2

When the function find the matching it calls function `add_replacer` and pass a `Match` object to it. 

This object has a method `.group()` that provides access to the matched groups. 

We convert them into integers, add them and return the result converted to a string.

This string is substituted instead of the matched pattern.

In [None]:
# Eaxmple from https://medium.com/python-in-plain-english/the-incredible-power-of-pythons-replace-regex-6cc217643f37
import re
  
def add_replacer(match_obj):
    return str(int(match_obj.group(1)) + int(match_obj.group(2)))

def eval_adds(string):
    return re.sub(r"(\d+)\s*\+\s*(\d+)", add_replacer, string)

print(eval_adds("the result is 1 + 2"))
print(eval_adds("the result is 6 + 4"))
print(eval_adds("the result is 15 + 5"))