<a href="https://colab.research.google.com/github/munich-ml/MLPy2021/blob/main/10_Logfile_challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logfile challenge

## setup

This section prepares the coding challenge.

**Step 1**: Clone GitHub project `MLPy2021`


In [None]:
import os

# Get munich-ml repo from GitHub
if "MLPy2021" in os.listdir():
    !git -C MLPy2021 pull
else:
    !git clone https://github.com/munich-ml/MLPy2021/

**Step 2**: Open file `logfile.csv` and store content into variable `s`

In [None]:
path = os.path.join("MLPy2021", "datasets", "logfile.csv")

with open(path, "r") as file:
    s = file.read()

## The task

After executing the setup, the variable `s` should be available, holding the content of the logfile.

The task is to **parse the logfile content**.

In [None]:
print(s)

# Python built-in functions

The Python interpreter has a couple of functions build-in. They are available without any preparation. 

Complete list of built-in funcitons: https://docs.python.org/3/library/functions.html

Some prominent examples:
- `print`
- `type`
- `min`
- `len`




In [None]:
type(s)

... the datatype of the variable `s` is `str`, string. 

In [None]:
len(s)

In [None]:
min(s)

`min` returns the smallest item of a collection. Maybe not very helpful for an `str` type object.

# `str` objects

## Slicing

Single characters of the string can be accessed using their index in brakets `[index]`:




In [None]:
s[0]

In [None]:
s[-1]

**Ranges** can be indexed by `[start_index : end_index]`.
The `start_index` is included, the `end_index` not!

In [None]:
s[10:60]

In [None]:
first_line = s[:26]
first_line

## `str` object methods

The `str` object not only contains **data**, it also contains **functions**, called **methods**.

In [None]:
first_line.upper()

In [None]:
first_line

In [None]:
first_line = first_line.upper()

In [None]:
first_line

## Exercise with `str` objects
- Create a variable `my_name` holding your Name, 
- print the name in lower case, 
- check its length and
- check its type 

### Solution

In [None]:
my_name = "Holger Steffens"
my_name

In [None]:
print(my_name.lower())
print(len(my_name))
print(type(my_name))

## `split` method

In [None]:
first_line.split(" ")

Split the logfile content `s` on the *new line character* `\n`

In [None]:
lines = s.split("\n")
lines

# `list` objects

In [None]:
type(lines)

In [None]:
len(lines)

In [None]:
lines[0]

... this seems better than counting the first lines's characters and slicing the string manually!

`list` objects also support slicing:

In [None]:
lines[2:8]

In [None]:
lines[8:]

## Exercise with `list` objects
Reuse the `my_name` variable and print the number of characters in your first and second name.

### Solution

In [None]:
my_names = my_name.split()   # using the default seperator of .split()
print(my_names[0], len(my_names[0]))
print(my_names[1], len(my_names[1]))

# `for` loops

The next step in the *logfile challenge* requires looping. 

- **`for`, `in`** are keywords in Python, as well as
- **`while`**,
- and of course there are more keywords: [all Python keywords](https://docs.python.org/3/reference/lexical_analysis.html#keywords)

A *C-style* `for` loop such as:

>`lines[0]`
>
>`lines[1]`
>
>`...`

requires an *index variable*. 


In [None]:
for i in [0, 1, 2]:
    print(i)

The build-in function `range()` simplifies the loop:

In [None]:
for i in range(3):
    print(i)

## looping over the `lines` object

In [None]:
lines

In [None]:
for i in range(len(lines)):
    line = lines[i]
    print(line)

## Adding an `if` statement

In [None]:
for i in range(len(lines)):
    line = lines[i]
    
    if line == "header":
        print("'" + line + "' found!")

## `list` objects are iterable

Iterable objects can be looped without using an index!

In [None]:
iter(lines)

In [None]:
for line in lines:     # this prototype is very common!
    
    if line == "header":
        print("'" + line + "' found!")

## Exercise with `for` loops
Reuse the `my_name` variable once again to number of characters for first and second name. Use a `for` loop this time.

### Solution

In [None]:
my_name

In [None]:
for name in my_name.split():
    print(name, len(name))

## Search for multiple keys, simultaniously:

In [None]:
keys = ['measurements', "header"]

Two `for` loops could be used to search a list of possible keys on every line of the logfile:

In [None]:
for line in lines:
    for key in keys:
        if line == key:
            print("'" + line + "' found!")

... that works!

However, there is a more simple solution by using the `in` operator on a python container (such as a list):

In [None]:
"0815" in keys

In [None]:
"header" in keys

In [None]:
for line in lines:
    if line in keys:
        print("'" + line + "' found!")   

## `enumerate` operator

if an index is required, use `enumerate`

In [None]:
bag_of_words = ["Hello", "Team", "Python"]

for item in bag_of_words:
    print(item)

In [None]:
for item in enumerate(bag_of_words):
    print(item)

In [None]:
item

In [None]:
type(item)

# `tuple` objects
**Tuples** are basic Python containers and similar to **lists**

In [None]:
item

In [None]:
i, value = item
value

... this is called `unpacking`.

Next, let's use `enumerate` and unpacking to search the lines for keys:

In [None]:
for lineNo, line in enumerate(lines):
    if line in keys:
        print("'" + line + "' found in line no. " + str(lineNo))   

Better using the `str.format()` method

In [None]:
"'{}' fount in line no. {}".format("world", 1)

In [None]:
for lineNo, line in enumerate(lines):
    if line in keys:
        print("'{}' fount in line no. {}".format(line, lineNo))   

## Exercise with `enumerate`
Find the position of all vocals `aeiou` in the `my_name` variable.

### Solution

In [None]:
VOCALS = "aeiou"
my_name

In [None]:
for position, character in enumerate(my_name):
    if character.lower() in VOCALS:
        print("vocal '{}' found a position {}".format(character, position))

# `dict` objects
`dcit` is a Python **dictionary**, holding `key`-->`value` pairs.

In [None]:
idxs = dict()

In [None]:
type(idxs)

In [None]:
for lineNo, line in enumerate(lines):
    if line in keys:
        idxs[line] = lineNo 

In [None]:
idxs

In [None]:
idxs["measurements"]

Adding more items to the dictionary

In [None]:
idxs["names"] = idxs["measurements"] + 1
idxs

In [None]:
idxs["params_begin"] = idxs["header"] + 1
idxs["params_end"] = idxs["measurements"] - 1
idxs["data"] = idxs["names"] + 1

In [None]:
idxs

### `dict` methods



In [None]:
def print_object_attributes(obj):
    print("{} has the following (public) attributes:".format(obj.__class__))
    for attr in dir(obj):
        if not attr.startswith("__"):
            doc = getattr(obj, attr).__doc__.split("\n")[0]
            print(".{:25}{}".format(attr, doc))

In [None]:
print_object_attributes(idxs)

In [None]:
idxs.keys()

In [None]:
idxs.values()

In [None]:
idxs.items()

In [None]:
item = next(iter(idxs.items()))
print(item, type(item))

In [None]:
for x in idxs.keys():
    print(type(x), x)

In [None]:
for x in idxs.values():
    print(type(x), x)

In [None]:
for x in idxs.items():
    print(type(x), x)

The common way to loop over `dict` `items` is by **unpacking**:

In [None]:
for key, value in idxs.items():
    print("key: {:13} value: {}".format(key, value))

## Exercise with `dict`
Count the characters within the `my_name` variable. Create a `characters` dictionary with
- each unique character as `key`, and
- the count as `value`.

Example:
`"Greg"` --> `{'g': 2, 'r': 1, 'e': 1}`

### Solution

In [None]:
characters = dict()
for c in my_name.lower():
    if c in characters.keys():
        characters[c] += 1
    else:
        characters[c] = 1

characters

# Table of Python containers
|Python container|list|tuple|set|dictionary|
|---|---|---|---|---|
|creation|`li = [1, 1, "Hi!"]`|`tup = (1, 1, "Hi!")`|`se = set([1, 1, "Hi!"])`|`d = {"Musk":"Elon", "Bezos":"Jeff"}`|
|print return|`[1, 1, 'Hi!']`|`(1, 1, 'Hi!')`|`{1, 'Hi!'}`|`{'Musk': 'Elon', 'Bezos': 'Jeff'}`|
|mutablility|mutable|**immutable**|mutable|mutable|
|slicing|yes, `li[0]` --> `1`|yes, `li[0]` --> `1`|no slicing|no slicing|
|primary usage|basic container with<br>`append()` method|use if data doesn't<br>change|- get unique values<br>- set operations: `union`, `diff`,|lookup table|



# `set` objects

A Python `set` is a mutable container (it can grow like a `list`) that contains **unique values**.

Therefore, one typical use-case is getting the unique values of another container: 

In [None]:
set(my_name.lower())

`set` objects provide **set operations** as methods:

In [None]:
print_object_attributes(set(tup))

# Return to the 'Logfile challenge'

In [None]:
lines

In [None]:
idxs

Now we can **slice** the `lines` to more handy variables:

In [None]:
params_lines = lines[idxs["params_begin"] : idxs["params_end"]]
params_lines

In [None]:
data_lines = lines[idxs["data"] :]
data_lines

Finally, let's slice the `names` line:

In [None]:
lines[idxs["names"]]

We can create a `names` `list` directy by splitting on ','

In [None]:
names = lines[idxs["names"]].split(",")
names

In [None]:
params_lines

Storing the parameters is another perfect application for a dictionary

In [None]:
params = dict()
for line in params_lines:
    key, value = line.split(",")
    params[key] = value
params

perfect!

... see how readable the code get's in the following:

In [None]:
params["measurement date"]

There is yet one imperfection of the `params` dictionary: The `type` of all `values` is `str`. 

We will work on that later, after parsing the data.

In [None]:
data_lines

In [None]:
names

In [None]:
data_lines[0].split(",")

## Functions

A function is needs that:
- takes a `str` input
- converts the `str` to a `float`
- removes `Ohms` text
- removes `mOhms` text and divides the value by 1000

Before programming the actual functionality, it is good practivce to write the test bench first:


In [None]:
test_vector = data_lines[0].split(",")    # reuse the first data line as test vector
test_vector[-1] = "100 mOhms"             # replace the last item, so there is a 'mOhms' item
test_vector

In [None]:
def string_to_float(s):
    return s + " processed by 'string_to_float'"

for item in test_vector:
    result = string_to_float(item)
    print(type(result), result)

How to convert a `str` to a `float`?

In [None]:
float("3.33")

In [None]:
float("3.33 ")

In [None]:
#float("3.33 Ohms")

Remove the "Ohms" first

In [None]:
for item in test_vector:
    print(item, "Ohms" in item)

... that would work.

However, using the `str` methon `find` has the advantage of providing the **index** where it was found.

In [None]:
for item in test_vector:
    print(item, item.find("Ohms"))

In [None]:
for item in test_vector:
    idx = item.find("Ohms")
    if idx > 0:
        prefix = item[idx-1]
        mults = {" ": 1, "m": 0.001}
        print(item, mults[prefix])
    else:
        print(item)

... almost done!

Just to the *type casting* on the item and apply `mults`:

In [None]:
def string_to_float(s):
    idx = s.find("Ohms")
    if idx > 0:
        number = s.split(" ")[0]
        prefix = s[idx-1]
        return float(number) * {" ": 1, "m": 0.001}[prefix]

    return float(s)

In [None]:
test_vector

In [None]:
for item in test_vector:
    result = string_to_float(item)
    print(type(result), result)

Great, the function `string_to_float` works!

Let's use it for parsing the data:

In [None]:
data_lines

The data is 2-dimensional: 
- 4 values per sample / row
- N rows

One potential solution for this application is a list of lists:

In [None]:
data = list()

for data_line in data_lines:
    row = list()
    for item in data_line.split(","):
        row.append(string_to_float(item))
    data.append(row)

data

**Status** on the logfile challenge. We got:
- `data` in a list of lists (N x 4)
- `names` list with column names
- `params` parameter dictionary

In [None]:
names

In [None]:
params

# Complete `parse_logfile_string` function

Let's summarize the functionality we got so far, into a function `parse_logfile_string` that:
- takes the logfile string `s` as input and
- returns `params`, `names` and `data`

In [None]:
def parse_logfile_string(s):
    # split the input string on "\n" new line
    lines = s.split("\n")

    # create a look-up table of sections and line numbers
    idxs = dict()
    for lineNo, line in enumerate(lines):
        if line in ['measurements', "header"]:
            idxs[line] = lineNo 
    idxs["names"] = idxs["measurements"] + 1
    idxs["params_begin"] = idxs["header"] + 1
    idxs["params_end"] = idxs["measurements"] - 1
    idxs["data"] = idxs["names"] + 1

    # parse the column 
    names = lines[idxs["names"]].split(",")

    # parse the params_lines list(str) into params dict{param: value}
    params = dict()
    for line in lines[idxs["params_begin"] : idxs["params_end"]]:
        key, value = line.split(",")
        params[key] = value

    # converts str to float incl. "Ohms" removal
    def string_to_float(s):
        idx = s.find("Ohms")
        if idx > 0:
            number = s.split(" ")[0]
            prefix = s[idx-1]
            return float(number) * {" ": 1, "m": 0.001}[prefix]
        return float(s)

    # parse data_lines list(str) into data list(list(floats))
    data = list()
    for data_line in lines[idxs["data"] :]:
        row = list()
        for item in data_line.split(","):
            row.append(string_to_float(item))
        data.append(row)

    return {"params": params, "names": names, "data":data}

In [None]:
log = parse_logfile_string(s)
log

In [None]:
log.keys()

done!

We successfully created a **function** `parse_logfile_string(s)` that parses the input text string `s` and returns a `dict` with 
- the header parameters `params`, 
- the actual data `data`,
- the data column names `names` 

# Limitations of Python basic containers

The Python general purpose container `list` worked great for reading and appending items from file. However, for further mathematical processing it's non-ideal, as we will see,...

## Apply calibration factors

The task:
- search the `params` for `calibration factor sig?` keys
- multiply all column values of `sig?` with the respective `calibration factor`


In [None]:
log = parse_logfile_string(s)
params = log["params"]
data = log["data"]
names = log["names"]

In [None]:
for param, cal_factor in params.items():
    if "calibration factor" in param:
        sig = param.split(" ")[-1]
        print("Signal={}, cal_factor={}".format(sig, cal_factor))

Next step is to index the data: Get all rows of a specific column...

In [None]:
names

In [None]:
names.index("sig0")

In [None]:
for param, cal_factor in params.items():
    if "calibration factor" in param:
        sig = param.split(" ")[-1]
        col_index = names.index(sig)
        print("Signal={}, col_index={}, cal_factor={}".format(sig, col_index, cal_factor))

Next, we need to index all rows of the col_index...

In [None]:
data[0]

In [None]:
data[0][col_index]

The `list[rows][cols]` can't be sliced in `cols` first!

As a solution, we can iterate over the rows and access the `col_index` within each row:

In [None]:
for data_row in data:
    val = data_row[col_index]
    print(val)

Put it all together:


In [None]:
data[:3]

In [None]:
for param, cal_factor in params.items():
    if "calibration factor" in param:
        sig = param.split(" ")[-1]
        col_index = names.index(sig)

        for data_row in data:
            data_row[col_index] *= float(cal_factor)

In [None]:
data[:3]

This worked, **BUT**

it would be much more convenient and readable to have something like 

`data[:, col_index] *= cal_factor`

What we want is **arbitrary indexing** and **element-wise operations**!





Maybe we even want **labeled indexing** like

`data[:, sig] *= cal_factor`

## Array slicing for visualization

Another common usage is **arbitrary array slicing** for data visualization (e.g. using `matplotlib`)


In [None]:
names

In [None]:
x, y = list(), list()
for data_row in data:
    x.append(data_row[names.index("x")])
    y.append(data_row[names.index("sig0")])

In [None]:
import matplotlib.pyplot as plt
plt.plot(x, y, "o");