# Exercise 2 (solution)

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path
from seaborn import load_dataset

## Task 1: List and dict comprehensions

Assume that we have a dictionary that stores some quality measures for models of different sizes:

In [None]:
results = {
    "large": {"acc": 0.9, "f1": 0.85},
    "medium": {"acc": 0.83, "f1": 0.87},
    "small": {"acc": 0.7, "f1": 0.5},
}

Create a variable called `acc` that maps the models to accurracy. 

In [None]:
acc = {name: info["acc"] for name, info in results.items()}
acc

Filter the results dictionary such that only the information for models with an accurracy over 0.8 is kept.

In [None]:
best = {name: info for name, info in results.items() if info["acc"] >= 0.8}
best

## Task 2: Create numpy arrays

Create the following arrays:

1. A three-dimensional array of shape `(2, 3, 4)` containing zeros
2. A two-dimensional array with 3 rows and 4 columns that contain that is equivalent to the list `[[0.1, 0.2, 0.3, 0.4, 0.5], [0.6, 0.7, 0.8, 0.9, 1. ]]`. Do not just type in the numbers.
3. Create a 3 x 3 identity matrix
4. Create a 3 x 4 empty array (using `np.empty`) and compare it's entries with the ones your neighbor gets. 

In [None]:
np.zeros((2, 3, 4))

In [None]:
np.linspace(0.1, 1, 10).reshape(2, 5)

In [None]:
np.eye(3)

In [None]:
np.empty((3, 3))

## Task 3: Numpy indexing

Through the entire task, work with the arrays a and b from the lecture slides

In [None]:
a = np.arange(5)
b = np.arange(12).reshape(4, 3)

Select the middle element of a

In [None]:
a[2]

Select all of a (but you need to put something into the square brackets)

In [None]:
a[:]

Select the last two rows of b

In [None]:
b[2:]

Select the last two columns of b

In [None]:
b[:, 1:]

Select the last two columns of the last two rows of b

In [None]:
b[2:, 1:]

## Task 4: Numpy calculations

In [None]:
x = np.array([[0.5, 1.5], [2.5, 3.5]])
y = np.diag([2, 3])
z = np.array([2, 3])

Do the following calculations with the arrays x, y, z

1. Do a matrix multiplication of the two arrays x and y
2. Do an elementwise multiplication of the matrices x and y
3. Do an elementwise addition x and z
4. Do an elementwise addition of x and `z.reshape(-1, 1)`
5. Describe the difference between the last two tasks.
6. Take the exponent of the array z
7. Sum the two rows in x

In [None]:
x @ y

In [None]:
x * y

In [None]:
x + z

In [None]:
x + z.reshape(-1, 1)

Adding z to x without reshaping, adds z to each row of x
Adding z to x with reshaping, adds z to each column of x
The way this works is governed by the [broadcasting rule](https://numpy.org/doc/stable/user/basics.broadcasting.html). We won't go into detail for now.

In [None]:
np.exp(z)

In [None]:
x.sum(axis=0)

## Task 5: File paths

Define a path called `ROOT` that leads to the directory in which you store all materials for this course. Define the path relative to this notebook and then convert it to an absolute path. Note: The solution is different for everyone and depends on the directory structure you chose. 

In [None]:
ROOT = Path().parent.resolve()
ROOT

Define a path to this notebook and use it to proof that this notebook exists

In [None]:
NB_PATH = Path() / "exercise_2.ipynb"
NB_PATH.exists()

## Task 6: Read and save DataFrames

In [None]:
iris = load_dataset("iris")
iris.head()

Save this dataset in each of the file formats presented in the slides. Then re-load them into a DataFrame using the corresponding `read` function

In [None]:
iris.to_csv("iris.csv")
iris_csv = pd.read_csv("iris.csv")
iris.to_stata("iris.dta")
iris_dta = pd.read_stata("iris.dta")
iris.to_pickle("iris.pkl")
iris_pkl = pd.read_pickle("iris.pkl")
iris.to_parquet("iris.parquet")
iris_parquet = pd.read_parquet("iris.parquet")

Look at the dataset you reloaded from the csv file. What do you see? 

The index became an additional column. This can be fixed by using more arguments when reading and saving csv files. I just wanted you to be aware of it. 

## Task 7: Create Variables

Add the square and the log of each numerical variable (i.e. all but "species") in the dataset. Use "NAME_squared" and "log_NAME" as naming conventions. Do not type in variable lists.

In [None]:
raw_vars = [c for c in iris.columns if c != "species"]
df = iris.copy()
for var in raw_vars:
    df[f"log_{var}"] = np.log(df[var])
    df[f"{var}_squared"] = df[var] ** 2
df

## Task 8: Select data

Select all rows where the species is setosa and the sepal_length is greater or equal to 5

In [None]:
iris.query("sepal_length >= 5 & species == 'setosa'").head()

## Task 9: Errors and Tracebacks

Write me a message in zulip where you describe this error

In [None]:
df = pd.DataFrame(data=np.ones(2, 2), columns=["a", "b"])