# Day 1

The first day is mainly supposed to be a recap to bring everyone up to speed, but also learn about some less-known details about these fundamental concepts and libraries. We will have a look at (or revisit):

- [Functions](#functions)
- [Dictionaries](#dictionaries)
- [Exceptions](#exceptions)
- Some commonly used libraries ([NumPy](#numpy), [Matplotlib](#matplotlib), and [Pandas](#pandas))
- [SciPy](#scipy)

## Credits

Part of the material includes and builds upon the material from *Scientific Programming with Python* (WiSe 2023/2024), which was kindly provided by **Andriy Sokolov**.

## Functions

**Why functions?**

- **Modularity:** Functions allow you to break down a complex program into smaller, manageable pieces. 
- **Readability:** Functions make the code more readable by providing meaningful names to blocks of code. 
- **Abstraction:** Functions abstract away implementation details. 
- **Code Reusability:** Once you've written a function, you can reuse it in different parts of your program.
- **Parameterization:** Functions can accept parameters (inputs) and return values (outputs). This makes your code flexible and general-purpose. 
- **Debugging:** Functions can be tested and debugged independently of the rest of the program. 
- **Namespace Isolation:** Variables declared inside a function are local to that function, i.e., they are isolated from the rest of the program.

### Functions with Different Argument Types

- Functions need to have the form `def <fun_name>([<arguments>]):`
- Return statements are **not required**
- Note that in the first function call, we provide only **positional arguments**, while in the third function call, we provide **keyword arguments** (which is recommended).
- In **f**, the parameter **b** has a default value of **3** and only needs to be specified if we want to deviate from it.
- `*args` packs positional arguments into a **tuple**
- `**kwargs` packs arguments into a **dictionary** (we can also **unpack** a dictionary into function arguments, which we will see later)
- Note that we can also return multiple arguments.

In [None]:
def f(a, b, *args, **kwargs):
  print(F'a = {a}, b = {b}')
  print(F'args = {args}')
  print(F'kwargs = {kwargs}')
  return a + b - kwargs["x"], a - b  # return statement (not required)

# Function call
res1, res2 = f(1, 2, 'foo', 'bar', 'baz', 'qux', x=100, y=200)
print(res1)

### Creating and Printing Docstrings

An important aspect in collaborative coding or when publishing code is to document your code well. Writing **docstrings** for your functions can help with that. It will be stored in `__doc__`.

In [None]:
def f(a, b=3):
  """ Add up two two numbers.
  Parameters:
    a (int, float): first number
    b (int, float): second digit (default=3)
  Output:
    Sum of a and b
  """
  return a + b
print(f.__doc__)

### Lambda Functions

A lambda function is a small anonymous function. A lambda function can take any number of arguments, but can only have one expression: `lambda <arguments> : <expression>`.

In [None]:
g = lambda x : x + 1
val = g(10)
print(val)

def generic_fun(a, x, lambda_fun):
    return a * lambda_fun(x)

print(generic_fun(4, 10, g))

#### Mini-Exercise

You are given a list of numbers.  

1. Use a lambda function to create a new list where each number is squared. Hint: use the [map()](https://www.w3schools.com/python/ref_func_map.asp) function, and do not forget to convert the result back into a list.
2. Then, use another lambda function together with [filter()](https://www.w3schools.com/python/ref_func_filter.asp) to keep only the numbers greater than 20 from the squared list.

Print both lists.

In [None]:
numbers = [2, 4, 6, 8, 10]

# TODO: Write your solution below.


---

## Dictionaries

Another data type in python are **dictionaries**, which store data values in *key-value* pairs, where a **key** has to be an immutable data type (str, int, float) and the **value** can be a mutable; even a dict itself.

### Some Basic Functionality

In [None]:
d = dict(name = "John", age = 36)
## Access
print(d["name"])    # access via key
print(d.get("age")) # access with get function
print(d.keys())     # return all keys
print(d.values())   # return all values

d["name"] = "Bob"   # change via assignment
print(d["name"])
d.update({"age": 52}) # change with update function
print(d.get("age"))

d[42] = "The Answer"
print(d)
d.pop("age")  # removes element, also "del d["age"]"
print(d)
d2 = d.copy() # copies the content of d to d2
d.clear()     # clears the dict
print(d)
print(d2)

### Pass Dicts as Function Arguments (Unpacking)

In [None]:
def greetPerson(name, last_name):
  print(f"Hi {name} {last_name}!")

students = {
  "students1" : {
    "name" : "Emil",
    "last_name" : "Mustermann"
  },
  "students2" : {
    "name" : "Lara",
    "last_name" : "Schmidt"
  },
  "students3" : {
    "name" : "Linus",
    "last_name" : "Musk"
  }
}

# Greet everyone
for key in students:
    greetPerson(**students[key])

___

### Exercise 1 (Student Grades Tracker)

1. Create a dictionary named `student_grades` where **keys** are student **names** (e.g., "Alice", "Bob", "Charlie"), and **values** are lists of grades (e.g., [85, 90, 78]).
2. Write a function `add_grade(student, grade)` that takes two arguments: the student's name and the grade to add. Then the function adds the grade to the student's list in the dictionary. If the student doesn't exist, create a new entry for them.
3. Write a function `calculate_average(student)` that takes one argument: the student's name. It returns their average grade or `None` if the student doesn't exist.
4. Write a function `highest_average()` that returns the name of the student with the highest average grade, or `None` if the dictionary is empty.

**Hint:** You can apply the functions `sum` and `len` to a list.

In [None]:
## TODO:  Write your solution below.

In [None]:
# Test cases
student_grades = {}

add_grade("Alice", 85)
add_grade("Alice", 90)
add_grade("Bob", 78)
add_grade("Bob", 82)
add_grade("Charlie", 95)

print(calculate_average("Alice"))  # Expected output: 87.5
print(calculate_average("Bob"))    # Expected output: 80.0
print(highest_average())           # Expected output: "Charlie"

---

## Exceptions

To handle exceptions, we need to know which kind of exception will be raised. Here are some examples:

- **SyntaxError:** Raised when there is a syntax error in the code, indicating a mistake in the way the code is written.
- **IndentationError:** Raised when there is an indentation error.
- **NameError:** Raised when a local or global name is not found.
- **TypeError:** Raised when an operation or function is applied to an
object of an inappropriate type.
- **ValueError:** Raised when a function receives an argument of the
correct type but with an inappropriate value.
- **FileNotFoundError:** Raised when a file or directory is requested, but it cannot be found.
- **IndexError:** Raised when a sequence subscript is out of range.
- **KeyError:** Raised when a dict is accessed with a non-existing key.

In [2]:
import random

# Generate a random number between 0 and 4
x = random.randint(0, 2)
print(0 / x)

ZeroDivisionError: division by zero

___
#### Mini Exercise

1. Fix the code above to raise your own exception!
2. Fix the code above by making sure x is not zero with an assertion.

Remark: Feel free to set the maximum of `randint` to 2 instead of 4 for evaluating your code.

In [1]:
# TODO: Write your solution for Fix 1 below.

In [2]:
# TODO: Write your solution for Fix 2 below.

### Exception Handling: Try Except Block

In [1]:
import random

try:
    x = random.randint(0, 2)
    print(0 / x)
except ZeroDivisionError:
    print("0 / 0 is not well-defined!")

0 / 0 is not well-defined!


___

### Some Best Practices

- Favor **specific exceptions** over generic ones: Raising the more specific `ZeroDivisionError` is better than just `Exception`. 
- Provide **informative error messages** and avoid exceptions with no message (helps with debugging).
- **Favor built-in exceptions** over custom exceptions.
- **Avoid raising the AssertionError** exception: You should avoid raising the AssertionError in your code. This exception is specifically for the assert statement, and itâ€™s not appropriate in other contexts.
- **Raise** exceptions as **soon as possible**.

___
### Exercise 2 (Division Calculator)

Complete the following code to cleanly perform a division from user inputs. Handle the following exceptions (note that you can put multiple `except` blocks after a `try`) block:

1. ValueError: If the input is not a valid number.
2. ZeroDivisionError: If the denominator is zero.

Further, include an *else* block to print a success message if no exceptions occur and add a *finally* block to print a message indicating the program has terminated.

In [5]:
# Hint 
# To read an input, you can use the following code:
x = input("Enter the numerator: ")
print(x)

4


In [4]:
# TODO: Write your solution below.

___

### Exercise 3 (Number Guessing Game)

Complete the function below such that the program:

1. Randomly selects a number between 1 and 100 (provided)
2. Iteratively asks the user to guess the number (see Hint 1)  
3. Provides feedback if the guess is too high or too low (via `print`).
4. Continues until the user correctly guesses the number, or types "quit" to exit.
5. Output the number of attempts if the user guessed correctly and output a print to congratulate the user for guessing correctly. 

Note: Make sure that your program does not crash if the user does not input an integer number or the word "quit". 

In [None]:
# Hint 1
guess = input("Take a guess: ")
print(guess)
print(type(guess))

In [None]:
# Hint 2
int("Test")

#### Solution 

To solve the exercise, complete the code below.

In [None]:
# TODO: Write your solution below.

import random

# Step 1 (provided)
# Generate a random number between 1 and 100
number_to_guess = random.randint(1, 100)
attempts = 0

In [None]:
# Test cases
student_grades = {}

add_grade("Alice", 85)
add_grade("Alice", 90)
add_grade("Bob", 78)
add_grade("Bob", 82)
add_grade("Charlie", 95)

print(calculate_average("Alice"))  # Expected output: 87.5
print(calculate_average("Bob"))    # Expected output: 80.0
print(highest_average())           # Expected output: "Charlie"

## Libraries

### NumPy

 **NumPy** is one of the most widely used libraries in data science and machine learning applications. This is due to several reasons:
 
 - The core of **NumPy** is implemented in optimized C code, which makes the computation efficient.
- ***NumPy** has an efficient backbone for working with N-dimensional matrices (via vectorization and indexing).
- Mathematical functions, random number generators, linear algebra routines, Fourier transforms, etc. are implemented.
- **NumPy** is compatible with libraries for visualization, sparse matrices and distributed computing.

In [None]:
import numpy as np

# Create a 2-D array, set every second element in
# all rows (except row 0) and find max per row:
x = np.arange(15, dtype=np.int64).reshape(3, 5)
x[1:, ::2] = -99
print(x)

# Compute max for each column (axis=1)
print(x.max(axis=1))

# Compute mean for each row (axis=0)
print(x.mean(axis=0))

### Matplotlib

**Matplotlib** is a great library for visualizing data, results of your estimator, etc., with [plenty of examples](https://matplotlib.org/stable/plot_types/index.html) and [tutorials](https://matplotlib.org/stable/tutorials/index.html).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Generate normally distributed random numbers
rng = np.random.default_rng()
samples = rng.normal(size=2500)

# Visualize the data with matplotlib via a histogram
plt.figure().set_figheight(2.5)
plt.hist(samples, bins=30, density=True)
plt.xlabel("Domain of X (partitioned to bins)")
plt.ylabel("Estimated density")
plt.title("Histogram density plot")
plt.show()

### Pandas

**Pandas** provides data structures for efficiently storing and manipulating large datasets, along with tools for reading and writing data in various formats. The name is derived from the term **Panel Data.**

#### Pandas (Filtering and 'groupby')

In [None]:
import pandas as pd

# Example dataset
data = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Alice', 'Bob', 'Charlie'],
    'Subject': ['Math', 'Math', 'Science', 'Science', 'Math'],
    'Score': [75, 99, 45, 88, 95]
})

# Filter rows where Score > 80
filtered_data = data[data['Score'] > 80]

# Group by "Name" and calculate the average score per student
average_scores = filtered_data.groupby('Name')['Score'].mean()
print(average_scores)

#### Pandas (Handling Missing Data)

In [None]:
import pandas as pd

# Example dataset with missing values (NaN)
data = pd.DataFrame({
    'Name': ['Alice', 'Bob', None, 'Charlie'],
    'Age': [None, 34, None, 30],
    'Score': [92, None, None, 78]
})

# Drop rows with any missing 'Name'
cleaned_data = data[data['Name'].notna()]

# Fill missing values with mean of column
filled_data = cleaned_data.fillna({
  'Age': data['Age'].mean(), 
  'Score': data['Score'].mean()
})
print(filled_data)

___

### Exercise 4 (Sales)

You are analyzing the sales of a fictional store over one week. Follow these steps:

1. Generate Sales Data: Use **NumPy** to generate random sales numbers for 7 days (e.g., with `np.random.randint()`). Create a list of weekdays (["Monday", "Tuesday", ..., "Sunday"]). 
2. Create a Pandas **DataFrame**: Combine the weekdays and the sales numbers into a Pandas `DataFrame` with columns `Day` and `Sales`.
3. Visualize the Data: Create a bar chart (`plt.bar()`) using Matplotlib to show daily sales.
4. Store your data table in CSV format (use ';' as a delimiter) showing the header column (use `to_csv()`).

Bonus: Make your results reproducible.

In [6]:
## TODO: Write your solution below.

## SciPy

**SciPy** is a free and open-source Python library used for scientific computing and technical computing. 

### Integrals

Single integrals are handled with the `quad(func, a, b)` function. **func** is the function that will be integrated (can be a lambda function) and **a** and **b** are the limits, which can be equal to `np.inf`.

**Remarks:**

- `quad` uses a technique from the Fortran library QUADPACK.
- `quad` returns a tuple consisting of "y : float" **the integral** of `func` from `a` to `b`, and "abserr : float", an **estimate of the absolute error** in the result.

Example for a single integral with `scipy.integrate.quad`: $\int_0^1 12 x \, dx$

In [None]:
from scipy.integrate import quad
f = lambda x: 12*x
integral = quad(f, 0, 1)
print(integral)

Double integral with `dblquad`: $\int \int_{-\infty}^{+\infty} e^{-(x^2 + y^2)} \, dy \, dx$

In [None]:
import numpy as np
from scipy import integrate
f = lambda y, x: 1
integrate.dblquad(f, 0, np.pi/4, np.sin, np.cos)

We can even solve triple integrals with `tplquad` or more with `nquad`.

___

## Exercise 1 (Integrals with SciPy)

Take a look at the SciPy [documentation for solving integrals](https://docs.scipy.org/doc/scipy/tutorial/integrate.html#general-integration-quad). Feel free to explore your a bit solving your favorite integral. After that, please solve the following two integrals with SciPy:

1. $\int_{0}^{1} e^{-x^2} \, dx \approx 0.7468$
2. $\int_{0}^{\pi/2} \int_{0}^{\sin(x)} x \cdot y \, dy \, dx \approx 0.4334$


In [None]:
# Write your solution below

### Linear Interpolation

- **Interpolation** is a method of constructing new data points within a range of known data points, which is useful for
  - Filling missing data
  - Smoothing noisy data
  - Resampling datasets

#### Why SciPy for Interpolation?

- The `scipy.interpolate` module provides tools for:
  - **1D interpolation** (e.g., linear, cubic splines)
  - **Multidimensional interpolation** (e.g., grid-based methods)
- SciPy is efficient since it builds upon the **FITPACK Fortran subroutines** and is easy-to-use.

The simplest form of interpolation is **linear interpolation**, which connects two points with a straight line.

In [None]:
import numpy as np
from scipy.interpolate import interp1d
import matplotlib.pyplot as plt

# Data points
x = [0, 1, 3, 4, 6, 7, 8, 9]
y = np.sin(x)

# Create linear interpolator
f_linear = interp1d(x, y, kind="linear")

# New x values for interpolation
x_new = np.linspace(0, 9, num=50)
y_new = f_linear(x_new)

# Plot interpolated and original points
plt.figure().set_figheight(2.5)
plt.plot(x, y, 'o', label='Data Points')
plt.plot(x_new, y_new, '-', label='Linear Interpolation')
plt.plot(x_new, np.sin(x_new), '-', label='Original')
plt.legend()
plt.show()

### Interpolation with Cubic Splines

Try it yourself, interpolate the data with a different kind of interpolation by setting the attribute `kind` (see [documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.interp1d.html#scipy.interpolate.interp1d)) and plot the results in comparison to the linear interpolation. Use, e.g., `cubic` or `nearest`.

In [None]:
# Write your solution below

### Interpolation in 2D

**Miniexercise:** Run the code below and modify the positions in the code that are marked with `TODO`. Can you find some general trends?

In [None]:
import numpy as np
from scipy.interpolate import griddata

# Defining a function on 2D
def func(x, y):
    return x*(1-x)*np.cos(4*np.pi*x) * np.sin(4*np.pi*y**2)**2

# Create a 2D grid on which we will evaluate the interpolation
# The grid is defined on [0,1]x[0,1], and XXXj defines how many points
# in this range are plotted.
grid_x, grid_y = np.mgrid[0:1:100j, 0:1:100j] # TODO: vary the number before each "j"

# We draw 10 random datapoins from our function
rng = np.random.default_rng()
points = rng.random((1000, 2)) # TODO: Change the number of data points we can use for interpolation
values = func(points[:,0], points[:,1])

# Interpolate based on three different methods
grid_z0 = griddata(points, values, (grid_x, grid_y), method='nearest')
grid_z1 = griddata(points, values, (grid_x, grid_y), method='linear')
grid_z2 = griddata(points, values, (grid_x, grid_y), method='cubic')
# TODO: try out some other interpolation methods

# Plot the results
plt.subplot(221)
plt.imshow(func(grid_x, grid_y).T, extent=(0,1,0,1), origin='lower')
plt.plot(points[:,0], points[:,1], 'k.', ms=1)
plt.title('Original')
plt.subplot(222)
plt.imshow(grid_z0.T, extent=(0,1,0,1), origin='lower')
plt.title('Nearest')
plt.subplot(223)
plt.imshow(grid_z1.T, extent=(0,1,0,1), origin='lower')
plt.title('Linear')
plt.subplot(224)
plt.imshow(grid_z2.T, extent=(0,1,0,1), origin='lower')
plt.title('Cubic')
plt.gcf().set_size_inches(6, 6)
plt.show()

### SciPy linalg

The `scipy.linalg` module provides linear algebra operations, building on the functionality of the **NumPy** `linalg` module. `scipy.linalg` is a powerful tool for performing advanced linear algebra operations in scientific computing and numerical analysis.

Here is an (incomplete) list of amazing things you can do with `scipy.linalg`:

- Compute the **inverse** or the **determinant** of a matrix
- Perform an **eigenvalue**, **QR**, **singular value**, **Cholesky**, ... **decomposition** of a matrix
- Solve a linear system

In [None]:
from scipy.linalg import inv, det, eig, svd, qr

A = np.array([[1,2],[3,4]])
A_inv = inv(A)
print(A_inv)        # inverse
print(det(A))       # determinant

# Eigenvalue decomposition
eigenvalues, eigenvectors = eig(A)
print(f"EVal: {eigenvalues}\nEVec: {eigenvectors}")

# Singular value decomposition
U, S, VT = svd(A)
print(f"U: {U}\nS: {S}\n VT: {VT}")

# QR decomposition
Q, R = qr(A)
print(f"Q: {Q}\nR: {R}")

___

## Exercise 2 (Linear Regression)

1. Implement functions `linearRegression` that take as input two variables $X$ (possibly multivariate) and $y$ (univariate), which manually compute the solution for $\hat{\beta}$ of linear regression via the analytical solution $\hat{\beta} = (X^T X)^{-1} X^T y$. Return the estimate $\hat{\theta}$.
2. Apply your function to the provided simulated data, and evaluate if your estimate is correct.
3. Simulate the data for different sample sizes, and compare the estimates.
4. Evaluate your estimator on the second simulation and plot your estimate $\hat{\theta}$ together with the training data.

**Hints:**

- Make sure that you extend the input $X$ by a column containing only ones to fit the intercept. You can use the following code given a matrix $X$: `X = np.hstack((np.ones((X.shape[0], 1)), X))`.
- You might find the `reshape` or `flatten` [function](https://saturncloud.io/blog/understanding-the-differences-between-numpy-reshape1-1-and-reshape1-1/) useful.

In [None]:
# Write your solution below

In [None]:
# Simulation 1
np.random.seed(42)

# Generate random independent variables (500 samples, 3 features)
n_samples = 500
n_features = 3
X = np.random.rand(n_samples, n_features) * 5
y = np.dot(X, [3, -2, 2] ) + np.random.normal(loc=1.5, scale=1.0, size=n_samples)

beta_hat = linearRegression(X,y)
print(beta_hat)

In [None]:
# Simulation 2
import matplotlib.pyplot as plt
np.random.seed(42)

# Generate random independent variables (200 samples, 2 features)
n_samples = 200
n_features = 1
X = np.random.rand(n_samples, n_features)
y = np.dot(X, [2.4] ) + np.random.normal(loc=0.5, scale=0.2, size=n_samples)

# Write your solution below