# <center>Class 3</center>

## User-defined functions (UDFs)

A function is a block of organized, reusable code that is used to perform a single, related action. Functions provide better modularity for your application and a high degree of code reusing.

You can define functions to provide the required functionality. Here are simple rules to define a function in Python.

* Function blocks begin with the keyword ```def``` followed by the function name and parentheses ```( )```.
* Any input parameters or arguments should be placed within these parentheses. You can also define parameters inside these parentheses.
* The first statement of a function can be an optional statement - the documentation string of the function or docstring.
* The code block within every function starts with a colon (```:```) and is **indented**.
* The statement ```return``` [expression] returns a value, or a serious of values, a list, a dictionary, .... A return statement with no arguments is the same as
```python
    return None
``` 
* Nevertheless, functions do not have to close with a `return` statement.

A simple function which does not return anything.

In [None]:
def print_my_name(name):
    print(f'This is may name: {name}.')

In [None]:
print_my_name('Peter')

In [None]:
print_my_name(1)

Functions, though, usually have some sort of return value. In addition, we can even have `type hints`. [Type hints](https://docs.python.org/3/library/typing.html) are just that: hints. The interpreter does not check the validity of the inputs so if the types are important you have to check them within your functions. Type hints are primarily for [readability and debugging](https://joshdimella.com/blog/python-typing-best-practices).

In [None]:
def divide_two_numbers(dividend: float, divisor: float) -> float:
    return dividend / divisor

In [None]:
divide_two_numbers(23, 5)

In [None]:
a = divide_two_numbers(23, 5)

In [None]:
print(a)

In [None]:
a

You can also add a default value to any of the inputs.

In [None]:
def divide_two_numbers(dividend: float, divisor: float = 2) -> float:
    return dividend / divisor

In [None]:
divide_two_numbers(10)

In [None]:
divide_two_numbers(10, 4)

Inputs, or *arguments* can be of two types:
- **positional arguments**: you only provide the input values when calling the function, all in the order of the function definition.

In [None]:
divide_two_numbers(20, 5)

- **keyword arguments**: you provide them using the argument names; in this case the order does not matter

In [None]:
divide_two_numbers(divisor = 5, dividend = 20) # the input order is switched but the function produces the right result

Additional considerations. 
1. You can return more than one objects.
2. You need to get used to adding correct documentation to your function. For documentation standards please read [this article](https://www.datacamp.com/tutorial/docstrings-python). It used to be a pain for most developers but the good news is that most coding LLM services will take for it for you. 

In [None]:
from typing import Tuple # our function returns two objects as a tuple

def divide_numbers(dividend: float, divisor: float) -> Tuple[float, float]:
    """
    Divide two numbers and return the quotient and remainder.

    Parameters:
    dividend (float): The number to be divided.
    divisor (float): The number by which to divide the dividend.

    Returns:
    Tuple[float, float]: A tuple containing the quotient and the remainder.
    
    Raises:
    ZeroDivisionError: If the divisor is zero.
    """
    quotient = dividend // divisor
    remainder = dividend % divisor
    
    return quotient, remainder
    

Once you have defined your function and added `docstrings` you can call the `help()` function to get information. 

Note: This docstring was generated through an LLM service.

In [None]:
help(divide_numbers)

In [None]:
q, r = divide_numbers(25, 3)
print(q, r)

In [None]:
q, r = divide_numbers(25, 0) # you can handle this using try - except

In [None]:
q, r = divide_numbers(10.2, 4.2)
print(q, r)

## Classes

This part of the course is for those who aspire to more than the very basics of Python. You can write good-quality, ecevutable codes without defining new classes but the more you get into Python the more you will find then useful. 

`Object-oriented programming` (`OOP`) is a programming paradigm that uses "*objects*" to design software. It is based on several key concepts that help organize code in a way that is modular, reusable, and easier to maintain. Here are the main principles of OOP:

1. **Classes and Objects**:
   - **Class**: A blueprint or template for creating objects. It defines a set of attributes (data) and methods (functions) that the created objects will have.
   - **Object**: An instance of a class. It represents a specific implementation of the class with its own unique data.

2. **Encapsulation**:
   - This principle involves bundling the data (attributes) and methods (functions) that operate on the data into a single unit, or class. It restricts direct access to some of the object's components, which can help prevent unintended interference and misuse of the data. Access to the data is typically controlled through public methods (getters and setters).

3. **Inheritance**:
   - Inheritance allows a new class (subclass or derived class) to inherit attributes and methods from an existing class (superclass or base class). This promotes code reusability and establishes a hierarchical relationship between classes.

4. **Polymorphism**:
   - Polymorphism allows methods to do different things based on the object it is acting upon, even if they share the same name. This can be achieved through method overriding (where a subclass provides a specific implementation of a method that is already defined in its superclass) and method overloading (where multiple methods have the same name but differ in parameters).

5. **Abstraction**:
   - Abstraction is the concept of hiding the complex implementation details and showing only the essential features of the object. This simplifies the interaction with the object and reduces complexity.

OOP is widely used in many programming languages, including Python, Java, C++, and C#. It helps in building scalable and maintainable software systems by promoting a clear structure and organization of code.

**OOP vs  Procedural Programming**

- Procedural programming
    - code as a sequence of steps
    - great for data analyis and short scripts
- Object-oriented programming
    - code as *interactions* of objects
    - great for building frameworks and tools
    - *maintainable and reusable code*

Classes are the key features of object-oriented programming. A `class` is a structure for representing an object and the operations that can be performed on the object. 

A class is defined with the `class` keyword and defines the *class attributes* (variables) and the *class methods* (functions). Class names are defined with `CamelCase`, functions and attributes with `lower_snake_case` as a PEP8 convention.

- Each class should have a `self` argument as a self-reference of the actual *instance* of the object.
- Some class methods have special meaning.   
      - `__init__()`: it is a `constructor` that assigns the initial mandatory attributes at the moment when the object is created.   
      - `__str__()`: it defines the *string representation* of the object, for instance when it is printed.   
      - `__repr__()`: is almost the same as `__str__()`, still a little different; it's the printable representation of the object. 



### Defining classes, class methods and class attributes

In [None]:
import math

In [None]:
class Triangle:
    
    def __init__(self, a: float, b: float, c: float):
        self.a = a
        self.b = b
        self.c = c
        self.area = self.calculate_area() # this is a class attribute which calculated by a class method defined below

    def calculate_area(self):
        s = (self.a + self.b + self.c) / 2
        area = math.sqrt(s * (s - self.a) * (s - self.b) * (s - self.c))
        return area

    def __str__(self):
        return ('The triangle has the following sides: {:,.2f}, {:,.2f}, {:,.2f}.'.format(self.a, self.b, self.c))

In [None]:
my_triangle = Triangle(3,4,5)

The cells above *instantiates* a Triangle object. **my_triangle** is an *instance* of the Triangle class. 

In [None]:
my_triangle.area

In [None]:
my_triangle.calculate_area()

In [None]:
print(my_triangle)

In [None]:
my_triangle

Note: you can use `import math` in the class definition as well, but it is generally recommended to place all import statements at the top of your Python file. This is a common practice for several reasons:

1. **Readability**: Having all imports at the top makes it easier for someone reading the code to see which modules are being used without having to search through the class definitions.

2. **Performance**: Importing modules at the top of the file ensures that they are loaded once when the module is first imported, rather than potentially being imported multiple times if the class is instantiated multiple times.

3. **Convention**: Following the convention of placing imports at the top of the file aligns wit[h the PEP 8 style](https://peps.python.org/pep-0008/) guide for Python code, which promotes consistency and readability.

However, if you have a specific reason to import a module within a class (for example, if the import is only needed in that class and you want to limit the scope), you can do so. Just keep in mind that it may not be the best practice in most cases.

### Polymorphism

In [None]:
class Square:

    """
    A class to represent a square.

    Attributes
    ----------
    a : float
        The length of the side of the square.
    area : float
        The area of the square, calculated upon initialization.

    Methods
    -------
    calculate_area():
        Calculates and returns the area of the square.
    """
    
    def __init__(self, a: float):
        self.a = a
        self.area = self.calculate_area()

    def calculate_area(self):
        area = self.a * self.a
        return area

In [None]:
my_square = Square(7)
my_square.area

In [None]:
my_square.calculate_area()

In [None]:
# We did not define a string representation for a Square object. 
print(my_square)

In [None]:
my_square

### Check the validity of object definition using an Exception

Create a custom Exception using `class inheritance` based on the `Exception` class.

In [None]:
class TriangleError(Exception): # now the interpreter knows that 'TriangleError' is an exception
    pass

We defined the *TriangleError* class as the the child class of the *Exception* class.

In [None]:
class Triangle:
    
    def __init__(self, a: float, b: float, c: float):
        if a + b > c and b + c > a and c + a > b:
            self.a = a
            self.b = b
            self.c = c
            self.area = self.calculate_area()
        else:
            raise TriangleError('Invalid triangle sides: any side should be smaller than the sum of the other two.')

    def calculate_area(self):
        s = (self.a + self.b + self.c) / 2
        area = math.sqrt(s * (s - self.a) * (s - self.b) * (s - self.c))
        return area

    def __str__(self):
        return ('The triangle has the following sides: {:,.2f}, {:,.2f}, {:,.2f}.'.format(self.a, self.b, self.c))

In [None]:
your_triangle = Triangle(3, 4, 10)

We can even catch these defined arrors

In [None]:
a, b, c = 3, 4, 15

try: 
    your_triangle = Triangle(a, b, c)
except TriangleError:
    print('Invalid triangle, redefining triangle using the shortest side only.')
    side = min(a, b, c)
    your_triangle = Triangle(side, side, side)
    

In [None]:
print(your_triangle)

In [None]:
# get attribute of an object contains its user-defined attributes and their values
your_triangle.__dict__

In [None]:
# returning all attributes and methods of an object (including magic methods and other general attributes added by Python)
dir(your_triangle)

### Comparison: `overloading` the comparison operators

- `__eq__()` is called when to objects are compared using `==`
- accepts two arguments: `self` and `other`
- returns a Boolean

In [None]:
class Rectangle:
    
    def __init__(self, a: float, b: float):
        self.a = a
        self.b = b
        self.area = self.calculate_area()

    def calculate_area(self):
        area = self.a * self.b
        return area

    def __eq__(self, other):
        return self.area == other.area

In [None]:
rectangle_1 = Rectangle(4,5)
rectangle_1.area

In [None]:
rectangle_2 = Rectangle(2,10)
rectangle_2.area

In [None]:
rectangle_1 == rectangle_2

Check out all the comparisons and other [special methods](https://docs.python.org/2/reference/datamodel.html#special-method-names) in the documentation.

#### Class attributes

These are data shared among all instances of a class. We define them in the body of a class. 

In [None]:
class Employee:
    # Class attributes
    MIN_SALARY = 30_000

    def __init__(self, first_name: str, last_name: str, salary: float):
        self.first_name = first_name
        self.last_name = last_name
        if salary >= Employee.MIN_SALARY:
            self.salary = salary
        else:
            self.salary = Employee.MIN_SALARY

    # Redefine/overload how the object itself printed to the console.
    def __repr__(self):
        return f"Employee('{self.last_name}, {self.first_name}', {self.salary})"

In [None]:
emp1 = Employee('Margaret', 'Mitchell', 25_000)
emp2 = Employee('Jean', 'Austen', 35_000)

In [None]:
emp1

In [None]:
emp2

In [None]:
emp2.salary

#### `Encapsulation`: private and public variables

In [None]:
# problem
emp1.salary = 25_000 # lower than MIN_SALARY
emp1

We are using *underscores* and `decorators` to manage *data access*. For the role of underscores read [this article](https://www.datacamp.com/tutorial/role-underscore-python). Python decorators are a bit more complicated concept. A good way to start is [here](https://www.freecodecamp.org/news/python-decorators-explained-with-examples/)

In [None]:
class Employee:
    # Class attributes
    MIN_SALARY = 30_000

    def __init__(self, first_name: str, last_name: str, new_salary: float):
        self.first_name = first_name
        self.last_name = last_name
        if new_salary < Employee.MIN_SALARY:
            self._salary = Employee.MIN_SALARY # note the underscore in _salary
        else:
            self._salary = new_salary # salary now is an internal, 'protected' attribute

    @property 
    def salary(self): # the @property decorator on a method whose name is exactly the name of the restricted attribute
                        # returns the internal attribute
                        # it is defined as a function but behaves like an attribute
        return self._salary
        
    # Redefine/overload how the object is printed to the console.
    def __repr__(self):
        return f"Employee('{self.last_name}, {self.first_name}', {self._salary})"

In [None]:
emp3 = Employee('Virginia', 'Woolf', 50_000)

In [None]:
emp3

In [None]:
dir(emp3)

In [None]:
emp3.salary

In [None]:
emp3.salary = 32_000 # this wont work since 'salary' itself is not an attribute, just an output returned by the @property decorator; '_salary' is the real attribute

You can add a `setter` method. 

In [None]:
class Employee:
    # Class attributes
    MIN_SALARY = 30_000

    def __init__(self, first_name: str, last_name: str, new_salary: float):
        self.first_name = first_name
        self.last_name = last_name
        if new_salary < Employee.MIN_SALARY:
            self._salary = Employee.MIN_SALARY
        else:
            self._salary = new_salary # salary now is an internal, 'protected' attribute

    @property 
    def salary(self): # the @property decorator on a method whose name is exactly the name of the restricted attribute
                        # returns the internal attribute
        return self._salary

    @salary.setter # now you can build additional checks in setting salary
    def salary(self, new_salary):
        if new_salary < Employee.MIN_SALARY:
            raise ValueError('Invalid salary.')
        else:
            self._salary = new_salary
        
    # Redefine/overload how the object is printed to the console.
    def __repr__(self):
        return f"Employee('{self.last_name}, {self.first_name}', {self._salary:,.0f})"

In [None]:
emp3 = Employee('Virginia', 'Woolf', 25_000)

In [None]:
emp3

In [None]:
emp3.salary = 25_000

In [None]:
emp3

In [None]:
emp3.salary = 42_000

In [None]:
emp3

Note: Python lets you redefine the attribute even if it is defined by a classmethod at instantiation. 

This makes Python less rigid regardung public and private variables than, for instance, Java, but the phylosophy of Python is that users and developers are adults who will not mess up these values just for the heck of it. 

In [None]:
my_triangle.area = 25

In [None]:
my_triangle.area

## Numpy -  multidimensional data arrays
### Introduction

The `numpy` package (module) is used in almost all numerical computation using Python. It is a package that provide high-performance vector, matrix and higher-dimensional data structures for Python. It is implemented in C and Fortran so when calculations are vectorized (formulated with vectors and matrices), performance is very good. 

To use `numpy` you need to import the module, using for example:

In [None]:
import numpy as np

In the `numpy` package the terminology used for vectors, matrices and higher-dimensional data sets is *array*. 



### Creating `numpy` arrays

There are a number of ways to initialize new numpy arrays, for example from

* a Python list or tuples
* using functions that are dedicated to generating numpy arrays, such as `arange`, `linspace`, etc.
* reading data from files

#### From lists

For example, to create new vector and matrix arrays from Python lists we can use the `numpy.array` function.

In [None]:
# a vector: the argument to the array function is a Python list
v = np.array([1,2,3,4])
v

In [None]:
# a matrix: the argument to the array function is a nested Python list
M = np.array([[1, 2], [3, 4]])
M

The `v` and `M` objects are both of the type `ndarray` that the `numpy` module provides.

In [None]:
type(v), type(M)

The difference between the `v` and `M` arrays is their shapes. We can get information about the shape of an array by using the `ndarray.shape` property.

**Question**: Why are $v$ lower-case and $M$ upper-case?

In [None]:
v.shape

In [None]:
M.shape

The number of elements in the array is available through the `ndarray.size` property:

In [None]:
M.size

In [None]:
v.flatten()

In [None]:
M.flatten()

Equivalently, we could use the function `numpy.shape` and `numpy.size`

In [None]:
np.shape(M)

In [None]:
np.size(M)

We can also do simple mathematical operations on numpy arrays.

In [None]:
M * 4

In [None]:
M * 4 + 3

So far the `numpy.ndarray` looks awefully much like a Python list (or nested list). Why not simply use Python lists for computations instead of creating a new array type? 

There are several reasons:

* Python lists are very general. They can contain any kinds of objects. They are dynamically typed. They do not support mathematical functions such as matrix and dot multiplications, etc. Implementing such functions for Python lists would not be very efficient because of the dynamic typing.
* Numpy arrays are **statically typed** and **homogeneous**. The type of the elements is determined when the array is created.
* Numpy arrays are **MEMORY EFFICIENT AND SUPERFAST**.
* Because of the static typing, fast implementation of mathematical functions such as multiplication and addition of `numpy` arrays can be implemented in a compiled language (C and Fortran is used).

Using the `dtype` (data type) property of an `ndarray`, we can see what type the data of an array has:

In [None]:
M.dtype

We get an error if we try to assign a value of the wrong type to an element in a numpy array:

In [None]:
M[0,0] = "hello"

If we want, we can explicitly define the type of the array data when we create it, using the `dtype` keyword argument: 

In [None]:
M = np.array([[1, 2], [3, 4]], dtype=complex)
M

In [None]:
M = np.array([[1, 2], [3, 4]], dtype=float)
M

In [None]:
M = np.array([[1, 2], [3, 4.0]])
M

Common data types that can be used with `dtype` are: `int`, `float`, `complex`, `bool`, `object`, etc.

We can also explicitly define the bit size of the data types, for example: `int64`, `int16`, `float128`, `complex128`.

### Using array-generating functions

For larger arrays it is inpractical to initialize the data manually, using explicit python lists. Instead we can use one of the many functions in `numpy` that generate arrays of different forms. Some of the more common are:

#### arange & linspace

In [None]:
# create a range
x = np.arange(0, 10, 1) # arguments: start, stop, step. Like the function range for lists!
x

In [None]:
x = np.arange(-1, 1, 0.1) #note that here we can use floats and non-integer steps. You could not do this with lists
x

In [None]:
# using linspace, both end points ARE included
np.linspace(0, 10, 25)

#### random data

In [None]:
from numpy import random #numpy has also its random set of functions

Uniform random numbers in [0,1]

In [None]:
np.random.rand(5,5)

<br>
 
standard normal distributed random numbers $\mu = 0$ and $\sigma^2=1$

In [None]:
np_a = np.random.randn(1000,1000)

In [None]:
np_a

In [None]:
np.mean(np_a) # this will be close to, but not equal to, zero

In [None]:
np.var(np_a)

<br>
 
standard normal distributed random numbers  𝜇 = 1  and  $\sigma^2=1$

In [None]:
np_b = np.random.randn(5, 5) + 1
np_b

In [None]:
np.mean(np_b)

In [None]:
np.var(np_b)

How do you make sure that noise will not make mean and variance meaningfully different from 1?

<br>
 
How do you generate an array of normally distributed random numbers where 𝜇 = 1  and  $\sigma^2=4$?

In [None]:
np_c = np.random.randn(1000,1000)*2 + 1

In [None]:
np.var(np_c)

In [None]:
np.mean(np_c)

### Index slicing

Index slicing is the technical name for the syntax `M[lower:upper:step]` to extract part of an array:

In [None]:
A = np.array([1,2,3,4,5])

It works in the same way as for **lists**. Refresh it (class 1)!

In [None]:
A[1:3]

Numpy arrays are **mutable**! 

In [None]:
A[1:3] = [-2,-3]
A

In [None]:
A[::] # lower, upper, step all take the default values

In [None]:
A[::2] # step is 2, lower and upper defaults to the beginning and end of the array

In [None]:
A[:3] # first three elements

In [None]:
A[3:] # elements from index 3

Negative indices counts from the end of the array (positive index from the begining):

In [None]:
A[-1:]

In [None]:
A[-2:]

Index slicing works exactly the same way for multidimensional arrays:

In [None]:
A = np.array([[n+m*10 for n in range(5)] for m in range(5)])

A

In [None]:
# a block from the original array
A[1:5, 1:3]

### Fancy indexing
Fancy indexing is the name for when an array or list is used in-place of an index: 

In [None]:
row_indices = [1, 2, 3]
A[row_indices,:] # this selects the second, third and fourth row of A, and all its columns

In [None]:
A[row_indices] #this is equivalent to the expression above

In [None]:
col_indices = [1, 2, -1] # remember, index -1 means the last element
A[row_indices, col_indices]

In [None]:
different_col_indices = [1, -1, 2] 
A[row_indices, different_col_indices]

### Data Processing With Numpy

In [None]:
import os

In [None]:
stockholm_wether = os.path.join(os.pardir, 'data', 'stockholm_daily_mean_temperature_1756_2017.txt')

In [None]:
data=np.loadtxt(stockholm_wether) 

In [None]:
data

In [None]:
data.shape

In [None]:
data.dtype

This is Stockholm weather data from 1756 through 2017. We only need to first four columns: year, month, day, average daily temperature. 

In [None]:
data = data[:, 0:4] # every row, columns from index 0 to index 3

In [None]:
data.shape

Elements in a numpy arrays always have a the same data type.

In [None]:
data[0:10,]

To have a better view we can call the `array_repr` method for a string representation of the array.

In [None]:
np.array_repr(data[0:5,], suppress_small=True) # This will give back a string of the rows in the numpy array

In [None]:
for row in data[0:10,]:
    print(np.array_repr(row, suppress_small=True))

### Quick Stats

#### mean

In [None]:
# the temperature data is in column 3
np.mean(data[:,3])

The daily mean temperature in Stockholm over the last 250 years has been about 6.1 C.

In [None]:
# another way of getting the same mean value is using the class method on the object instance 
data[:,3].mean()

#### standard deviations and variance

In [None]:
np.std(data[:,3]), np.var(data[:,3])

#### min and max

*min()* and *max()* together with many other statistical functions are ***class methods*** for any numpy arrays. It means that you can call them simply on your numpy object. 

What is this??? How do you handle that?

In [None]:
# lowest daily average temperature
data[:,3].min()

In [None]:
# highest daily average temperature
data[:,3].max()

### Masking: selecting subsets of arrays
Masking is a kind of fancy indexing.

In [None]:
mask = (data[:, 0] == 1971)
mask

In [None]:
data[mask].shape

In [None]:
data[mask, 0] # years only

In [None]:
 data[mask, 3] # temperatures only

In [None]:
print("The mean temperature in Stockholm in 1971 was " + str(np.mean(data[mask,3])))

In [None]:
print("The mean temperature in Stockholm in 1971 was {:.2f} degrees.".format(np.mean(data[mask,3])))

Get the unique values from an array

In [None]:
months = np.unique(data[:, 1]) # this wil give us the months
months

### High-performance calculations

#### Quantiles

One of numpy's main advantage to Pandas (see next time) is the high performance calculations. For instance quantiles are resource-intensive caluclations but numpy handles them smoothly. 

In [None]:
np.percentile(data[:,3], 10) # 1st decile of daily temperatures in the dataset

#### Handling outlier data

If $min$ and/or $max$ values are obviously off any meaningful range, either because of some anomaly or because of data error, we may want to use quantiles to define the 'very low' or 'very high' values.

In [None]:
np.percentile(data[:,3], 50) # how do you interpret this number?

In [None]:
np.percentile(data[:,3], 0.1) # and this?

#### Substituting and dropping outlier data

Find the weird observation(s), where temperature is -999, in the dataset using masking. Why do we have these values in the dataset? Do they imply something?

In [None]:
mask = (data[:, 3] == -999)
data[mask]

Substituting anomalous data with NA. Remember: numpy arrays are *mutable*!

In [None]:
data[mask, 3] = np.nan

In [None]:
data[mask]

In [None]:
type(np.nan)

The data makes more sense this way. 

In [None]:
data[:3].min()

NaNs, however, make some other calculations, for instance percentiles, disfunctional so we'd better drop these observations.

In [None]:
data.shape

In [None]:
data = data[~ mask] # ~ stands for the complementing set

In [None]:
data.shape

Finding the `mode` of the distribution.

In [None]:
import scipy

In [None]:
scipy.stats.mode(data[:, 3])

#### Excercise

Iterate through the months, calculate and print out the number of the month, the first (D1) and the ninth decile (D9) of the temperatures for that particular  month.

### copy and "deep copy"

To achieve high performance, assignments in Python usually do not copy the underlying objects. This is important, for example, when objects are passed between functions, to avoid an excessive amount of memory copying when it is not necessary (technical term: pass by reference). 

In [None]:
A = np.array([[1, 2], [3, 4]])
A

In [None]:
# now B is referring to the same array data as A 
B = A
B

In [None]:
# changing B affects A
B[0,0] = 10
B

In [None]:
A

If we want to avoid this behavior, so that when we get a new completely independent object `B` copied from `A`, then we need to do a so-called **"deep copy"** using the function `copy`:

In [None]:
B = np.copy(A)

In [None]:
# now, if we modify B, A is not affected
B[0,0] = -5
B

In [None]:
A

### Applying a function to a numpy array

In [None]:
theta = np.arange(-10, 10, 1)
theta

In [None]:
theta = theta.reshape(5,4)
theta

When you apply an UDF, something weird happens.

In [None]:
relu = lambda x: 0 if x < 0 else x
relu(theta)

You need to `vectorize` the function.

In [None]:
relu_v = np.vectorize(relu)

In [None]:
relu_v(theta)

Note: '*relu*' stands for '*rectified linear unit*' and it is widely used in neural networks.

## Linear algebra with Numpy

It goes beyond this course to delve into matrix algebra but here's a short example. 

In [None]:
M = np.array([[1,2], [3,4]])
v = np.arange(2)

In [None]:
M

In [None]:
print(M)

In [None]:
v

Define dot product as

$$
\mathbf{v} \in \mathbb{R}^n, \quad \mathbf{M} \in \mathbb{R}^{n \times n} \implies \mathbf{M} \cdot \mathbf{v} \in \mathbb{R}$$


In [None]:
np.dot(M, v)

The other way around:

$$
 \mathbf{v} \cdot \mathbf{M} \in \mathbb{R}$$


In [None]:
np.dot(v, M)

In [None]:
v = np.arange(0, 5)

In [None]:
v

In [None]:
np.dot(v, v) # what does the dot product of a vector equal to?

Inverting a matrix

In [None]:
from numpy.linalg import inv

In [None]:
inv(M)

In [None]:
np_singular = np.array([[1,2], [2,4]])

In [None]:
np_singular

In [None]:
inv(np_singular)

<br> 
 
## Extra: Processing logs with Python

### A side note: regex

**Regular expressions** allow you to specify a pattern of text to search for. Also called *regexes* for short, are descriptions for a pattern of text. For example, a \d in a regex stands for a digit character—that is, any single numeral from 0 to 9. 

In [None]:
import re

In [None]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

In [None]:
mo = phoneNumRegex.search('My number is 415-555-4242.')
print(mo.group())

Regex is 
- very complicated
- relatively slow
- can be used to various completely unstructured text.

A quick intro to regex with examples is here [http://automatetheboringstuff.com/2e/chapter7/](http://automatetheboringstuff.com/2e/chapter7/)

### Semi-structured text

Logs are usually text messages with a limited set of row/sentence *schema*. This schema helps us process text in a way which is 
- relatively simple
- fast
- but only works with texts of known structures.

**Task: find the hosts related to  authentication failures.**

In [None]:
file = os.path.join(os.pardir, 'data', 'Linux_2k.log')
file

In [None]:
with open(file, 'r') as f:
    logtext = f.read()

In [None]:
logtext[:500]

In [None]:
logtext.split('\n')[0:10]

In [None]:
len(logtext)

In [None]:
for line in logtext.split('\n')[0:10]:
    if 'authentication failure' in line:
        print(line.split())

In [None]:
for line in logtext.split('\n')[0:10]:
    if 'authentication failure' in line:
        print(len(line.split(' ')))

In [None]:
for line in logtext.split('\n')[0:10]:
    if 'authentication failure' in line:
        print(line.split('rhost='))

In [None]:
for line in logtext.split('\n')[0:10]:
    if 'authentication failure' in line:
        print(line.split('rhost=')[1].split()[0])

### Excercise
- Find logs with 'authentication failure'
- Collect the 'rhost' values (host addressses) as list them together with the appropriate month, day, and hour values in the following was: each row should look like:   
      - month day hour host_address

Many more on the numpy homepage [https://numpy.org/](https://numpy.org/)

<details><summary><b>Click here for the solution</b></summary>
    
```python
with open(file, 'r') as f:
    for line in f:
        if 'authentication failure' in line:
            line = " ".join(line.split()) # get rid of all (including double) whitespaces and link each element with a single whitespace
            month = line.split(' ')[0]
            day = line.split(' ')[1]
            hour = line.split(' ')[2].split(':')[0]
            host = line.split('rhost=')[1].split(' ')[0]
            print(month, day, hour, host)
```

</details>