# Day 1: Morning Session

# Introduction

Welcome to Python section of the Advanced Topics in Scientific Programming (ATSP) 2025 course! 

We start his course with Python, and we will have two and a half days of Python in Machine Learning material, where we will discuss the following:

- Day 1 will focus on basics of **Python**, **Numpy**, **Jupyter**, and then machine learning methods using **Sci-Kit Learn**
    - As I understand it, you you have already done Pandas in the introduction course?
    - If we need a refresher on Pandas, we can do so quickly in this morning session now. 
- Day 2 will focus on **Deep Learning** 
- Day 3 will focus on **model deployment** and your assignment **assignment** for the course
 
Your assignment will be to complete an end-to-end machine learning project - you will have the choice of several assignments. We will go in to more details regarding the assignment on Wednesday. One of the assignments will be an image classification assignment using deep neural networks, a second option for your assignment focusses developing a web application, and another option for your assignment relates to a text or tabular dataset.

## Admin and Materials
Let's first discuss some administrative matters and the material for the course.

We provide the materials as follows: 

- All course materials are provided and presented as Jupyter notebooks
- Jupyter is a web-based development enviroment for Python, R, etc.
- During the seminars, we will use this server to run and execute the code live
- Each of you has an account so you can log in to the server and access the materials and follow along while we code, you can also experiment with the code, etc.
- The server will only run during the seminar times. Outside of the seminars, you will be able to access the notebooks via our GitHub repository

This means that to run the notebooks outside of the seminar times, you will need to install and run Jupyter locally, or use something like Google Colab, which we discuss in detail tomorrow. Tomorrow afternoon, we will discuss your assignment, and in this session we can set up Jupyter on your local machines, etc.

### Day 1

Today we will discuss the following topics:

- What is Jupyter and how to use it
- Basics of Python
- Basics of NumPy
- Basics of Pandas (if required)
- Machine Learning basics
    - Classification, regression / Supervised vs. Unsupervised
    - Overfitting and underfitting
    - Model evaluation
- SciKit Learn
    - How to get data
    - How to generate data
    - Training, testing
    - Cross validation
    - Model persistence
    - Scaling and normalisation
    - Data preprocessing

We will split this content into the morning and afternoon sessions. 

So, in a bit more detail, first we discuss Jupyter, and how to use it. We will use Jupyter almost entirely for this course, to cover the material, and to run code examples and so on. You can follow along and try code examples. Jupyter is a good skill to have if you work in data science, machine learning, or academic research in general. For example, it is very useful for reproducibility and for organising code, presenting work, and for development work when you are working on a project. You can use it as an IDE, and completely replace the use of a text editor. It is web-based, meaning all you need is a browser to use it. And it can be used remotely, for example you can code at home while accessing a server at work or anywhere else in the world. It is worth noting that Jupyter is not only for Python development, there are so-called kernels for dozens of languages, including R, Julia, MATLAB, Mathematica, SPSS, you name it! 

GitHub (which we will cover) also renders Jupyter notebooks natively, so it can be quite useful as a way to present your work for a paper, for example. We will look in to this in detail later. For example, these course materials are hosted on GitHub, and you can view these notebooks (statically) on GitHub without needing to install anything. More on this later.

After covering Jupyter and how to use it, we cover the basics of Python. Python is a general purpose language, which makes it a very useful language to know - and of course it often used in the context of general Machine Learning tasks, but it completely dominates as the language of choice for Deep Learning. If you are doing any kind of Deep Neural Network training, you will most likely be using Python and frameworks such as Torch and TensorFlow. So in this section we will cover the basics of Python, such as functions, lists, classes, etc. 

Once we have covered the basics of Python, then we will cover some tips and tricks for using Python in the context of Machine Learning, and cover some aspects of the language that make it useful for Data Science work in general.

Next, we cover Numpy, which is a framework for handling data in Python. It provides essential functionality for manipulating array and matrix data and is used in practically any machine learing project. If required, we can cover Pandas as well, although I understand that this has already been covered in the introductory course.

After we have covered the basics of Python programming, Jupyter, and Numpy, we will move on to Machine Learning methods using Python. The focus of this course is practical Machine Learning. We will not discuss in detail how the algorithms we cover work, it is focussed on how to train algorithms, what are best practices, where to get data, etc. and we will be mostly using SciKit-Learn for this.

**Please note**: these notebooks are rather text-heavy: this is by design. The reason being, is that it should be possible for you to download and read and go through all these noteooks completely independently after the course has finished. Hence they should be self explanatory and be useful as a reference later, if you wish to go back to a topic (for your assignment, for example).

### Day 2

Day 2 will focus on Neural Networks and Deep Learning, as well as model deployment.

Topics include:

- Simple neural networks: we will use PyTorch to define a simple neural network and train it
- PyTorch: we will discuss the framework we will use to create neural networks
- Deep learning: we move to deep networks, the basis for all of the recent advancements, such as generative models and GPT models
- Image classification: we will train an image classifier on a number of small tasks
- Image segmentation: we discuss image segmentation in the context of medicine
- Pre-trained models: use networks that have already been trained
- Fine-tuning models: adapt a pre-trained network to your specific task 
- Model deployment and web application development / developing interfaces to your models

### Day 3

Day 3 is the morning session only, and we will mostly focus on your assignment. We will have time for you to start your assignments, and answer any questions you might have about the assignment. 

### Exercises

Throughout the course, we will complete some exercises. This will give you a chance to try what you have learned. So, periodically you will work on some programming exercises, which should typically take 20 minutes or so, and afterwards we will go over the exercises. They will **not** be used for your grading! The grading for the course will be based on your assignment only! 

## Get Started

So, as mentioned, this course we are going to be using Jupyter almost exclusively. We have set up a server so that you can follow along with the examples that we cover here.

The server can be accessed from:

- <http://learnlab.medunigraz.at>

**Note**: Do not worry if there is some warning about insecure HTTP: access to this server is limited to the MUG network and KAGes. No one else can access this server from outside of the MUG or the KAGes network.

To log in, use atsp + matriculation number, e.g. `at07316801` and the password that will be shown on the whiteboard. 

When you log in, you will see the notebooks for the course. If you do not see any notebooks, then something went wrong! Most likely you have entered an incorrect matriculation number. Let's make sure everyone is logged in before we continue.

The server is only accessible from within the MUG network and KAGes network. If you are working from home, it should be possible to log in via the MUG VPN, but I have not tested this.

The material (for example these notebooks, plus other material from the R part of the course, and datasets and so on) are all available under the following link on GitHub:

- <https://github.com/imigraz/ATSP-2025>

Let's have a look at the repository now. You can download the contents of the repository as a zip file, or you can clone the repository using Git.

---

# Jupyter

This environment that we are using now is called Jupyter. It provides an interactive web-based environment called notebooks. 

It is worth noting that Jupyter is not just for Python development. There are so-called kernels from dozens of languags, see <https://github.com/jupyter/jupyter/wiki/Jupyter-kernels>, including R. Also, Jupyter is used widely in acaedmia to disseminate results and experiments. It is therefore a very useful tool to aid reproducibility. A particularly good example of this can be found here: <https://www.nature.com/articles/ng.3051> - you will find that there are a number of Jupyter notebooks associated with the paper, and they can be used to reproduce the exact same plots found with in the paper itself: <https://github.com/theandygross/TCGA/>. If you supply your experiments as a Jupyter notebook, anyone can download the notebooks, and run them on their local machine, confirming your results. This can greatly enhance your chances of getting a paper published! 

GitHub natively supports Jupyter, so you can host notebooks for free there. Example: <https://github.com/mdbloice/Augmentor/blob/master/notebooks/Augmentor_Keras.ipynb>. GitHub is often used to host Jupyter notebooks for a publication, as we saw above. Note that GitHub hosts notebooks **statically**, meaning they cannot be executed. You can view them, and download them, but not run them. 

Back to the structure of a notebook. Notebooks can contain text, images, and also code. When you execute Python code, the code is sent to a server (which can running on your local machine or elsewhere), and the results or output of the code appear in the notebook.

Therefore, there are different types of cells. Text cells and code cells. 

This is a text cell. Below we have a code cell:

In [None]:
print("A code cell is executed and the results are printed to the notebook!")

The return result does not neccessarily only contain textual or numerical data, they can also be images or plots, of course.

## Jupyter Basic Usage

I will demonstrate the following now:

- Create new cells
- Making a cell active 
- Executing a cell
- Changing the cell type
- Placing a cell above and below the current position
- Hiding the output of a cell
- Moving a cell up or down

Also, the IDE itself allows you to use tabs, open terminals, view data, and so on. It is a proper IDE, where you can do most of work from! 

- Create a new notebook
- Browse notebooks
- Open a terminal
- Open a dataset

In the menu there are a few options to look at:

- **Run** menu
    - Run a cell
    - Run all cells
- **File** menu
    - Download
    - Convert
- **Kernel** menu
    - What is a kernel?
    - Interrupt kernel
    - Restart kernel

## Jupyter Tips and Tricks

Ok, so we have seen the basic usage, now let's discuss some mor advanced features you may not know about, even if you use Jupyter a lot.

### Magics 

You can execute magics using `%` and `%%`. 

Line magics begin with a single `%` and they apply only to **that same line**.

Cell magics begin with double `%%` and apply to the **entire cell**.

#### Line Magics

For example, to see a history of your previous commands, use `%history`:

In [None]:
%history -l 10

The `-l 10` argument prints only the last 10 commands.

Another useful one is the `%whos` command, which lists all currently stored variables and some information about each one.

Let's create a few different data structures:

In [None]:
x = [1, 2, 3]
y = "hello"
z = 3.14

In [None]:
%whos

You can also display only certain types, for example only strings using `str` or only lists using `list`.

In [None]:
%whos list

In [None]:
%whos str

#### Cell Magics

These operate on the entire cell. 

A very useful one is the `%%time` cell magic:

In [None]:
%%time

a = [1,2,4]
b = 6

c = a * b

This tells you how long a cell takes to execute. An ever more advanced version of this is `%%timeit` which repeats the code many times and reports the average execution time. This is very useful if some background process might have made an impact on the code's execution, for example.

In [None]:
%%timeit

a = [1,2,4]
b = 6

c = a * b

With `%%writefile`, you can write the entire contents of the cell to a file on the disk:

In [None]:
%%writefile atsp.txt

hello, world

Use the `-a ` parameter to append to a file:

In [None]:
%%writefile -a atsp.txt

welcome to the course

This is very useful for writing logs!

To see a list of all magics, use `%magic`. Be warned, this produces a lot of output! However, we can use this to demonstrate how to hide the output of a cell.

In [None]:
%magic

#### Magic Commands Help

You can use `?` for help on one particular magic command, for example for information on how to use the `%writefile` magic, you would execute the following:

In [None]:
%%writefile?

### Executing Command Line Commands
You can execute commands on the command line using the exclamation mark `!` at the beginning of a line, followed by the command:

In [None]:
! ls

In [None]:
! df -h

### Documentation
We briefly saw how `?` can be used to get documentation for magic commands. However, it can also be used for any function or class.

Use `?` to access documentation of any function or class, by appending it to the function or class name.

For example, we can view the documentation of the Python function `open()`, which is used to open files on the disk, as follows:

In [None]:
open?

**Note**: you do not write the parentheses, just the name of the function. This will fail, for example:

In [None]:
open()?

### Autocompletion / Intellisense

Use `Tab`  to autocomplete your code, such as showing you a list available functions: 

In [None]:
a = "some text"

In [None]:
a.capitalize()

Use `Shift + Tab` to bring up the documentation and the parameters for that function:

In [None]:
a.count()

## Keyboard Shortcuts

In the help menu, under Help -> Short Keyboard Shortcuts 

Some I use a lot:

- `Esc`: exits edit mode.
- `Enter`: enter edit mode
- `Shift` + `Enter`: execute a cell
- `A`: create a new cell **a**bove
- `B`: create a new cell **b**elow
- `D` + `D`: **d**elete current cell (hit `D` twice in quick succession)
- `Y`: Change cell to **code cell**
- `M`: Change cell to **text cell**
- `Shift` + `L`: show **l**ine numbers. This is useful for error messages.

All these options are available in the `Edit` menu above. For example, **Edit -> Delete Cell**.

When you execute a text cell, it **renders the cell**. You can double click on a rendered cell to view its formatting source. 

When you execute a code cell, it **executes the Python code** within the cell and shows its output.

### Markdown

As mentioned, text cells in Jupyter can be formatted using Markdown. 

Here is the most important Markdown formatting syntax:

Headings:
```
# Header 1
## Header 2
### Header 3
```

Text formatting:

```
Some **bold** text
Some *italics* text 
```

Lists:
```
- a list
- of unordered
- items

1. a list
2. of ordered
3. items

- [x] a list
- [ ] of todo
- [ ] items
```

Code:
```
A `code` sample
```
Blocks of code use 3 backticks.

Horozontal Rule:

```
---
```

Links:
```
[title](https://www.example.com)
```
or 
```
<https://example.com>
```

Images:
```
![Text Description](image.jpg)
```

See the following guide for a comprehensive list: <https://www.markdownguide.org/cheat-sheet/>

### Interactive Widgets

We will not cover this in much detail, however it is possible to run interactive widgets in Jupyer notebooks. 

To do so we use a package called `ipywidgets`: see <https://ipywidgets.readthedocs.io/en/stable/> for more details. 

Using interactive elements you can have sliders embedded with the Jupyter notebook for example, where variables can be adjusted visually and so on:

In [None]:
import ipywidgets as widgets
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from IPython.display import display

# Interactive Function with Multiple Widgets
@widgets.interact
def plot_function(x=widgets.IntSlider(min=-10, max=10, step=1, value=0),
                  y=widgets.IntSlider(min=-10, max=10, step=1, value=0)):
    print(f"Coordinates: ({x}, {y})")

These variables can be used to adjust the parameters of a model for example, where the results of the new parameter changes are updated in the plot whenever you make any adjustments:

In [None]:
def plot_regression(num_points, noise_level, slope):
    plt.clf()  # Clear the plot first, done before each update.
    
    # Generate data
    x = np.linspace(0, 10, num_points)
    # Create true line with specified slope
    y_true = slope * x + 2  # Using 2 as a fixed intercept
    # Add random noise
    y = y_true + np.random.normal(0, noise_level, num_points)
    
    # Reshape x so it is compatible with SciKit-Learn
    X = x.reshape(-1, 1)
    
    # Fit linear regression
    reg = LinearRegression()
    reg.fit(X, y)
    y_pred = reg.predict(X)
    
    # Calculate R-squared
    r2 = reg.score(X, y)
    
    # Plotting
    plt.scatter(x, y, color='blue', alpha=0.5, label='Data points')
    plt.plot(x, y_true, 'g--', label='True line')
    plt.plot(x, y_pred, 'r-', label='Regression line')
    
    plt.grid(True)
    plt.title(f'Linear Regression (R² = {r2:.3f})')
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.legend()
    
    # Print the regression equation
    equation = f'y = {reg.coef_[0]:.2f}x + {reg.intercept_:.2f}'
    plt.text(0.05, 0.95, equation, transform=plt.gca().transAxes)

# Create the widgets
points_slider = widgets.IntSlider(
    value=50,
    min=10,
    max=200,
    step=10,
    description='Points:',
    continuous_update=False  # Important!
)

noise_slider = widgets.FloatSlider(
    value=1.0,
    min=0.1,
    max=5.0,
    step=0.1,
    description='Noise:',
    continuous_update=False  # Important!
)

slope_slider = widgets.FloatSlider(
    value=2.0,
    min=-5.0,
    max=5.0,
    step=0.1,
    description='True Slope:',
    continuous_update=False  # Important!
)

# Create and display the interactive plot
plt.figure(figsize=(10, 6))

# What we have here is an interactive widget, which is being sent the plot
# to manipulate, the function plot_regression(), and its parameters are
# sent by querying the 3 sliders, points_slider, noise_slider, and slope_slider.
interactive_regression = widgets.interactive(
    plot_regression, 
    num_points=points_slider,
    noise_level=noise_slider,
    slope=slope_slider
)

display(interactive_regression)

**Note**: One important thing to consider when writing and running widgets, is the `continuous_update` parameter. This should normally be set to false, so that only when you stop moving the slider, does the code execute. Otherwise all intermediate steps along the transition will be executing, making it much slower!

## Excercise 1: Using Jupyter

In this exercise you will use some of the Jupyter specific functionality that we have covered.

Note that in the code cells, comments begin with `#`, therefore you can ignore those lines and do not need to delete them, if you do not want to.

### Exercise 1: Magic Commands

#### Question 1.1

Find out how to print the current working directory using a magic command.

In [None]:
# Place your answer here

%pwd

#### Question 1.2

Show the documentation for the command that you found for the answer above:

In [None]:
# Place your answer here

%pwd?

#### Question 1.3

What's the difference between `%time` and `%timeit` - use the documentation if you need help, using `?`.

Write your answer here.

#### Question 1.4

What is the different between magic commands that start with `%` and magic commands that start with `%%`?

Write your answer here

# Basics of Python

Good guide https://swcarpentry.github.io/python-novice-gapminder/04-built-in.html

- Variables
- Types
- Comments
- Built-in Functions
- Writing Functions
- Classes
- Modules

## What is Python

Python is a general purpose programming language. This differs somewhat from R which is much more focussed on statistics and data analysis. For example in Python you can write GUI applications, web applications, server-side applications, as well as use it for systems administration, machine learning, or statistics. It is a very broad language.

Therefore, Python has become probably the most popular language in the world, with the exception of perhaps JavaScript, which is used for web development (and is not as general purpose as Python).

We will see from the next section that it is a very simple programming language, with minimal syntax.

## Variables

You declare variables in much the same way as in R:

In [None]:
a = 2

Python is **dynamically typed**. This means in Python you do not need to declare which **types** you are using, meaning that you do not need to declare that the variable above is an integer.

In other languages you would need to say:

```java
// Java code
int age = 21;
String name = "Marcus";
```

What you are doing here is telling Java that `age` will contain an integer and `name` will contain a string.

In programming a **type** is the kind of variable. All programming languages will have types such as integers (whole numbers, -3, -2, -1, 0, 1, 2, 3), floats (real numbers, 3.14 or 0.00005), strings (text, "example"), collections such as **lists**, **dictionaries**, and **tuples**,  and others.

Being dynamically typed doesn't mean Python doesn't have types at all, it just means that they are automatically assigned based on the data. You can find a variable's type at any time using the `type()` function:

In [None]:
type(a)

In [None]:
type("Example")

Therefore, if you try to add two variables of different types, Python will try to work out what you are doing:

In [None]:
a = 2
b = 4.5

print(a + b)

In [None]:
type(a)

In [None]:
type(b)

Some languages will throw an error for trying this, or at the very least a warning.

This automatic type coversion can only go so far, however, as the following demonstrates:

In [None]:
a = 2
b = "4.5"
print(2 + b)

The solution to this is to **cast** your variable:

In [None]:
a = 2
b = "4.5"

# Convert string b to float
b = float(b)

print(2 + b)

You will use the `print()` function **a lot** to check the output of you commands, and to see what is stored in your variables and so on.

Here are some examples of `print()` in use.

Printing a string:

In [None]:
print("Hello, world")

Print multiple items, spaces are inserted automatically by default:

In [None]:
print("Hello", "ATSP", "class", "2025")

You can change the separator:

In [None]:
print("banana", "apple", "orange", sep=", ")

By default, print() prints a new line after each call:

In [None]:
print("Item 1")
print("Item 2")

You can change this using the `end` parameter:

In [None]:
print("Item 1", end=" | ")
print("Item 2")

You can print variables directly:

In [None]:
print(a)

To print multiple variables:

In [None]:
print(a, b)

You can print a combination of strings and variables using what is known as f-strings.

You define an f-string using `f` before the string. Then all variables contained in curly crackets `{}` are printed inline.

Here we want to print the contents of the variables `a` and `b`:

In [None]:
print(f"The variable a contains {a} and b contains {b}. The sum of these variables is {a+b}.")

Notice how you could perform the addition of a+b inline, as anything within the curly braces is executed.

Python has a number of common operators, such as `+`, `-` and so on:

In [None]:
a = 8
b = 16

# Addition and substraction
c = b - a
print(c)

# Multiply
d = c*c
print(d)

Division:

In [None]:
15 / 5

Exponentiation is performed using `**`:

In [None]:
e = c**c
print(e)

Notice how a float is returned. 

By using `//` you will return an integer, rounded down to the nearest integer (like a floor function):

In [None]:
10 // 3

Finally, modulus is `%`:

In [None]:
10 % 3

Which returns the remainder of the division.

Python includes the standard comparison operators:

In [None]:
print(a > b)    # Greater than
print(a < b)    # Less than
print(a >= b)   # Greater than or equal
print(a <= b)   # Less than or equal
print(a == b)   # Equal to
print(a != b)   # Not equal to

Logical operators work on boolean values, in other words `True` or `False`: 

In [None]:
x = True
y = False
print(x and y)  # Logical AND
print(x or y)   # Logical OR
print(not x)    # Logical NOT

Assignment operators shortcuts are useful to save some extra characters of code.

In [None]:
c = 5   
c += 2  # Add and assign, same as saying c = c + 2
print(c)

The same add and assign shortcut can be used for `-`, `/`, and `*`:

In [None]:
c = 5
c -= 1  # Subtract and assign: c = c - 1
print(c)

c = 5
c *= 2  # Multiply and assign: c = c * 2 
print(c)

c = 5
c /= 3  # Divide and assign: c = c / 3
print(c)

---

## Collections

There are 4 main collection types in Python, and you will come across all of them in Machine Learning tasks. 

You can think of collections as lists of items, and each of the 4 collection types have their own distinct properties and characteristics.

The 4 collection types are:

- Lists
- Dictionaries
- Sets
- Tuples

**Note**: There are actually many more than these 4 types of lists in Python, see the `Collections` module (<https://docs.python.org/3/library/collections.html>) which contains some more specialised container types, such as `namedtuple` and `OrderedDict`. These are typically used in very specific scenarios, however.

### Lists

Lists are unsurprisingly lists of objects. They can be numerical or text, or mixed. 

- Lists are **mutable**, that means they can be changed after they have been declared. 
- Lists are **ordered**, meaning their order is defined, which means they will maintain the order in which you added the elements, unless you explicitly change this. 
- You create lists by placing items inside **square brackets** `[]`, separated by commas.
- Lists **allow duplicates**: they can contain multiple occurrences of the same element.

Use lists when you have a collection of items where order matters and you might need to change the items (add, remove, or change - i.e. mutable).

Below we will demonstate some examples of using lists.

Creating lists can be done as follows:

In [None]:
numbers = [1, 2, 3, 4, 5]
print(numbers)

mixed = [1, "hello", 3.14, True]
print(mixed)

empty = []
print(empty)

If we take a look at the `numbers` list, we can see how to add items:

In [None]:
# Add to the end
numbers.append(6)
print(numbers)

# Add to the beginning, at index 0
numbers.insert(0, 0)
print(numbers)

# Add multiple items using another list with extend()
numbers.extend([7, 8, 9])  
print(numbers)

Do not use `append()` to add multiple items to a list! This will add the list to the list as a single item... use `extend()` instead.

Removing an element can be done using `remove()`. This will remove the **first occurance** of that element:

In [None]:
numbers.remove(0)
print(numbers)

**Note**: removing **all elements** requires using a Python construct called a **list comprehension**:

In [None]:
numbers_repeated = [1, 2, 2, 2, 2, 2, 3, 4, 5, 6]

[i for i in numbers_repeated if i != 2]

A **list comprehension** is basically a for loop that can be written in one line. There is no built-in function in Python to remove all elements from a list, so a loop is the only way to do it. 

We will cover comprehensions in more detail later, after we have learned about loops in general. 

One thing you will use a lot in Python in Data Science or Machine Learning is indexing your data stored in arrays.

You do this using square brackets `[]`, and indices are 0-based.

Here are some examples:

In [None]:
print(numbers)

first = numbers[0]  # First element
print(first)

last = numbers[-1]  # Last element
print(last)

second_last = numbers[-2]
print(second_last)

subset = numbers[1:4]  # Slicing: starts at 1 and ends BEFORE 4.
print(subset)

Above, we saw an example of slicing. One thing that occurs a lot in using Python for Data Science, or Machine Learning is selecting subsets of your data. Later we will go over this in detail, and how to perform complex slicing of your arrays (and 2D and 3D matrices).

You can modify elements using `=`. Lists are **mutable** meaning they can be changed after they are created (unlike sets, for example, which we will look at shortly).

Here are some examples of modifying a list:

In [None]:
numbers[0] = 10  # Change first element, at index 0
print(numbers)

numbers[1:4] = [20, 30, 40]  # Change multiple elements
print(numbers)

Operations can be performed on lists using `*` and `+`, like you can do on integers. 

For example:

In [None]:
doubled = numbers * 2 
print(doubled)

combined = numbers + [1000, 2000, 3000]
print(combined)

You might think that the the `*` would multiply each element of the list instead - it can, and we will get to this later. 

Sorting is possible, some examples are below:

In [None]:
numbers.sort()  # This sorts in-place
print(numbers)

# If you do not want this done in place, use sorted()
nums = [7,2,19]
sorted(nums)
print(nums)

# Sort in descending order, in place
numbers.sort(reverse=True)  
print(numbers)

# If you do not want in place, create a copy and then reverse it
sorted_nums = sorted(numbers) 
sorted_nums.reverse()
print(sorted_nums)

The functions `len`, `min`, and `max` can be used with lists:

In [None]:
length = len(numbers)  # Number of elements in list
print(length)

# Max and min
maximum = max(numbers) 
print(maximum)

minimum = min(numbers) 
print(minimum)

Removing items can be done using the `remove()` function (which we saw already above), but also with the `pop()` function or with the `del` keyword:

In [None]:
numbers

In [None]:
numbers = [1, 45, 90, 6, 17, 88, 22]
print(f"Numbers: {numbers}")

popped = numbers.pop()  # Remove and return the LAST element of the list
print(f"Removed {popped}, leaving {numbers}")

# Remove and return element at index 2
popped_index = numbers.pop(2)  
print(f"Removed {popped_index}, leaving {numbers}")

# Delete element at index 1
del numbers[1]  
print(f"Removed index 1, leaving {numbers}")  # Note we do not know which one removed! Using pop() is often better.

# Remove all elements, useful if you have a very large list in memory
numbers.clear()
print(numbers)

Lists are used in Data Sciences and Machine Learning all the time, however generally for performance reasons we will often use Numpy arrays instead. We will over these types of arrays later.

### Sets

Sets are like mathematical sets. Here are some of the properties of sets:

- **Mutable**: You can add or remove elements from a set
- **Unordered**: Sets do not record element position or order of insertion. Why might ask why? This can make them very fast and are useful for containing many millions of elements.
- You create sets by placing items inside **curly brackets** {}, or using the `set()` function.
- **Duplicates are not allowed**: A set cannot contain multiple occurrences of the same element.

You can use sets when you need to ensure that an element only appears once in a collection, often used for membership testing (e.g. storing usernames, email addresses). Also, they are used for mathematical set operations, such as intersection, union, and difference. 

In [None]:
# Creating sets
numbers = {1, 2, 3, 4, 5}
print(numbers)

fruits = set(['apple', 'banana', 'cherry'])
print(fruits)

empty = set()  # Can't use {} as that creates an empty dict
print(empty)

**Tip**: you can easily remove duplicates from a list by converting a list into a set, and this works very quickly even for very long lists:

In [None]:
numbers = [1, 2, 2, 3, 3, 3, 4, 4, 4, 4]

# Convert to set to remove duplicates:
unique_numbers = set(numbers)
print(unique_numbers)

# This also works for string based lists
fruits = ['apple', 'banana', 'apple', 'cherry', 'banana', 'apple']
unique_fruits = set(fruits)  
print(unique_fruits)

Mathematical union can be done using the `|` operator:

In [None]:
set1 = {1, 2, 3, 4}
set2 = {3, 4, 5, 6}

# Union
union = set1 | set2  # Using operator
print(union)

Intersection using `&`:

In [None]:
intersection = set1 & set2
print(intersection)

Difference using `-`:

In [None]:
difference = set1 - set2
print(difference)

# Here the order matters:
difference = set2 - set1
print(difference)

The symmetric difference using `^` (elements in either set, but not in both):

In [None]:
symetric_diff = set1 ^ set2 
print(symetric_diff)

Checking membership:

In [None]:
is_in = 1 in set1
not_in = 1 not in set1

print(is_in)
print(not_in)


Compare sets using `<=` and `>=` for subset and superset:

In [None]:
# Is set1 a subset of set2?
is_subset = set1 <= set2  
print(is_subset)

# Is set1 a superset of set2?
is_superset = set1 >= set2 
print(is_superset)

Check if a set has **no elements in common** using `isdisjoint()`:

In [None]:
is_disjoint = set1.isdisjoint(set2)
print(is_disjoint)

In [None]:
{1,2,3}.isdisjoint({4,5,6})

### Dictionaries

Dictionaries work differently in that they are arranged in key-value pairs. We will see dictionaries often later.

- **Mutable**: You can change, add, or delete key-value pairs after the dictionary is created.
- **Ordered**: (since Python 3.7) - generally this does not matter, as dictionaries are normally accessed via their keys.
- Syntax: Created with **curly brackets** {} containing **key/value pairs** separated by colons, in the form `{'key': 'value'}`. Keys are often strings, but can be integers or even floats. Values can be any type.
- **Unique Keys**: The **keys** have to be unique, but values can be duplicated.

Usage: Use dictionaries when you need to associate a unique key with a value.

Here is a simple dictionary, you can see that it is created using key-value pairs:

In [None]:
user = {
    'name': 'marcus', 
    'age': 21, 
    'city': 'limerick'
}

print(user)

You can access an element of a dictionary using the key:

In [None]:
user['name']

However, a safer approach is using the `get()` function:

In [None]:
user.get('name')

Why is `get()` safer? It allows you to set a default value if the key is not found:

In [None]:
user.get('citizenship', 'UNKNOWN')

Using `[]` will just return an error, which you'd have to handle.

In [None]:
user['citizenship']

Typically, you would create a dictionary per user, and store these dictionaries in a list, such as:

In [None]:
users = [
    {'name': 'marcus', 'age': 21}, 
    {'name': 'matilda', 'age': 4},
    {'name': 'theo', 'age': 1}
]

And access each item using the list's index:

In [None]:
for user in users:
    print(user['name'])

You can add and udpate fields as follows:

In [None]:
# Adding new key-value pair
user['email'] = 'marcus@example.com'
print(user)

# Updating existing value to 
user['age'] = 44  
print(user)

If you want to remove items, then use `del` or `pop()`:

In [None]:
del user['age']
print(user)

# Remove an item and return it
removed_value = user.pop('email')
print(user)
print(removed_value)

To check if a dictionary has a particular **key**, use `in`:

In [None]:
if 'name' in user:
    print("The 'name' key exists in the dictionary.")

To access the data, we use the functions `keys()`, `values()`, and `items()`:

In [None]:
keys = user.keys()
print(keys)

values = user.values()
print(values)

items = user.items()  
print(items)  # returns a list of key,value pairs.

You can loop through the dictionary in many ways, here is one example to print each key/value pair in a dictionary:

In [None]:
for key in user:
    print(key, user[key])

Or using the `items()` function, which returns the each key/value pair in the dictionary:

In [None]:
for key, value in user.items():
    print(key, value)

If you only cared about the values, then we can use the `values()` function:

In [None]:
for value in user.values():
    print(value)

If you want to merge two dictionaries, then use the `update()` function:

In [None]:
# Merging dictionaries
dict1 = {'name': 'marcus', 'age': 44}
dict2 = {'city': 'limerick', 'salary': 1575.55}
dict1.update(dict2)
print(dict1)

### Tuples

Tuples are **immutable** (unchangeable), ordered, and can contain duplicate items.

- **Immutable**: Once a tuple is created, you cannot change it, cannot add elements to it, or remove elements from it
- **Ordered**: The items have a defined order
- Syntax: Created by placing items inside **parentheses** `()`, separated by commas
- **Duplicates are allowed**: Tuples can contain multiple occurrences of the same element

Usage: Use tuples when you have an ordered collection of items that should never change! 

Creating tuples is done with round brackets/parentheses `()`:

In [None]:
numbers = (1, 2, 3, 4, 5)
print(numbers)

# Creating a single element tuple needs a comma:
single_element = (1,)
print(single_element)

# Empty tuple:
empty = ()
print(empty)

# Mixed type tuple:
mixed = (1, "hello", 3.14, True)
print(mixed)

You can access elements much like lists:

In [None]:
# First element
first = numbers[0]  
print(first)

# Last element
last = numbers[-1]
print(last)

# Slicing: elements at index 1, 2, and 3
subset = numbers[1:4] 
print(subset)

We will look at so-called 'slicing' of data structures in much more detail later.

You can count elements easily using `count()`:

In [None]:
pi = (3, 1, 4, 1, 5, 9)
pi.count(1)

And check for membership using `in`:

In [None]:
exists = 1 in pi
print(exists)

If you want a list just use the `list()` function:

In [None]:
pi_list = list(pi)
type(pi_list)

You can use `min()`, `max()`, and `len()` with tuples:

In [None]:
length = len(pi)
print(length)

# Max and min
maximum = max(pi)
print(maximum)

minimum = min(pi)
print(minimum)

You can sort a tuple using the `sorted()` function, but it returns a new list, which you need to convert back in to a new tuple:

In [None]:
sorted_tuple = tuple(sorted(pi))
print(sorted_tuple)

This is because tuples are immutable, so they cannot be sorted directly.

### Exercise 2: Collections and Data Structures

#### Question 2.1

Create a **list** of students, where each student is a **dictionary** within that list. 

The dictionary should contain their name, age, and matriculation number

In [None]:
# Your answer here


#### Question 2.2

Find the second largest number in the `exercise_list1` list below.

In [None]:
exercise_list1 = [3, 55, 11, 109, 60, 90, 3, 27]

# Add your code here:
sorted_list = sorted(exercise_list1)
sorted_list[-2]

#### Question 2.3

Create a list of colours called `colours`, with at least 3 colours. 

Then add a colour to this list, and print the list:

In [None]:
colours = ['red', 'green', 'blue']

colours.append('purple')

print(colours)

Now, remove the first element of the list:

In [None]:
colours.remove('red')
colours

Check if the colour turquoise is in the `colours` list. Hint: the `in` keyword is what you want to use. This can be done in one line.

In [None]:
'turquoise' in colours

## Control Flow

Python like any other langauge uses statements such as `if` and `else` to control the flow of a programme. 

In this course, we will only really be using `if`, `else` statements, and `for` loops. Therefore we will cover these now. 

We will use `if`/`else` statements to control the flow of the code, and we will mostly use `for` to loop over lists, sets, and so on. 

So you should of course all be aware of `if`, `else`, `elif` and so on from the introductory course:

In [None]:
age = 18

if age >= 18:
    print("You are an adult")

The comparison here `>=` meaning greater than or equal to.

A `else` statement allows you to capture flow if the `if` statement does not execute:

In [None]:
age = 16

if age >= 18:
    print("You are an adult")
else:
    print("You are a minor")

The `elif` statement means "else if" and is used as follows:

In [None]:
score = 85

if score >= 90:
    print("A grade")
elif score >= 80:
    print("B grade")
elif score >= 70:
    print("C grade")
elif score >= 60:
    print("D grade")
else:
    print("F grade")

It basically executes each successive comparison, and if none are executed, then finally the `else` statement executes.

You can have multiple conditions on one line, and this is seen often, which you combine using logical operators such as `and` and `or`. 

For example:

In [None]:
age = 25
income = 50_000 

# AND operator - both conditions must be true
if age > 18 and income > 30_000:
    print("Loan approved")

The `or` statement means at least one must be true:

In [None]:
# OR operator - at least one condition must be true
if age < 18 or income < 10000:
    print("Loan denied")

## Loops

The `for` loop in Python is most common type of loop. It simply iterates over all elements of a list for example. We will see this many times during the seminar.

The most common usage of `for` is used with `in`, as follows:

In [None]:
fruits = ["apple", "banana", "cherry"]

for fruit in fruits:
    print(fruit)

Imagine you have a list of colours: `["red", "blue", "green", "orange"]` and you wanted to loop over it.

### The `reversed` keyword

How would you iterate over a list backwards?

We just have to use the `reversed` keyword:

In [None]:
for colour in reversed(colours):
    print(colour)

### Looping in sorted order

You can use the `sorted()` function to loop in sorted order: 

In [None]:
for colour in sorted(colours):
    print(colour)

What if you wanted reverse order? You can pass the argument `reverse=True`:

In [None]:
for colour in sorted(colours, reverse=True):
    print(colour)

### Looping over a Dictionary

So far we have been looping over lists. It is also very easy to loop over a dictionary in Python. 

If you use `for` and `in` on a dictionary you loop over its keys.

Imagine you had the following dictionary:

In [None]:
dict = {'matilda': 'red', 
        'theo': 'green', 
        'liam': 'yellow'}

In [None]:
for key in dict:
    print(key)

Notice that `key` contains the keys, as this is often what you want to loop over, and access it's contents within the loop.

In [None]:
for key in dict:
    print(dict[key])

If you wish to access the keys and values, you can use the `items()` function:

In [None]:
for key, value in dict.items():
    print(value)

### Looping with indices

Notice when you loop using `for` and `in`, you do not get the indices of the list. Sometimes you may want that.

One way to do this would be to say:

In [None]:
colours = ['red', 'green', 'yellow', 'orange', 'purple']

for i in range(len(colours)):
    print(i, " --> ", colours[i])

This is the C way. In Python just use `enumerate`:

In [None]:
for i, colour in enumerate(colours):
    print(i, " --> ", colour)

You will see the `enumerate()` function often in Python when using it for Machine Learning and Data Science work. 

Basically, `enumerate()` enumerates of a list, and returns the item in the list and its index.

#### Question 2.4

Given the list `input` below, print only the even numbers. 

The operator modulus operator `%` can be used, for example.

Use your knowledge of `for` loops using `in` and your knowledge of `if` statements. 

In [None]:
input = [1, 2, 3, 4, 5, 6, 7, 8, 9]

for i in input:
    if i % 2 == 0:
        print(i)

## Functions

Functions are very similar to those in R or other languages. 

You define a new function using `def`, name the function (in this case `add_numbers()`), define its parameters (in this case `a` and `b`), and define what it returns (in this case the sum of `a` and `b`, which is stored in `answer`):

In [None]:
def add_numbers(a, b):
    answer = a + b
    
    return answer

The function is now defined. It will not execute until you call it. 

We shall do so now and add the numbers 5 and 10:

In [None]:
add_numbers(5, 10)

We will not get into more detail than that, if you know how to define a functions and call it, then this is sufficient knowledge for this course.

#### Question 2.5

Write a function that returns a list in reverse sorted order.

In [None]:
# Write your code here

### The `zip` keyword

Imagine you had two lists:

In [None]:
names = ["Matilda", "Theo", "Liam"]
colours = ['red', 'green', 'yellow', 'orange', 'purple']

And you want to loop over both lists, pairwise, one at a time. How would you do this?

You could say:

In [None]:
n = min(len(names), len(colours))

for i in range(n):
    print(names[i], " --> ", colours[i])

This works but in Python there is of course a better way, by using the `zip` keyword:

In [None]:
for name, colour in zip(names, colours):
    print(name, " --> ", colour)

If the lists are differnet lengths, it operates on the length of the shortest list:

In [None]:
list(zip(['a','b','c'], [1,2]))

### The `range()` function

The `range()` function is very useful for creating arrays of dummy data, etc.

For example, for the numbers 0 to 9:

In [None]:
for i in range(10):
    print(i)

When passing only one value n, you get the values 0 to n, not including n. 

The function actually takes the form `range([start,] end [,step])` so that you can say the following also, for example:

In [None]:
for j in range(10,20):
    print(j)

Here you passing the start and end values. It includes the start value, but does not include the end value.

Also you can specify a step also, for example:

In [None]:
for k in range(10,20,2):
    print(k)

### The `any()` and `all()` functions

The `any()` and `all()` functions are very useful for checking the contents of lists.

In [None]:
a = [False, True, True, True]
any(a)

This asks if any of the items in the list are true.

In [None]:
all(a)

While this checks to see if all the items in the list are true.

## Packages

Python has a lot of functionality built in, however there are of course 3rd packages available. 

Python packages are available via the *Python Package Index*, know simply as *PyPI*.

We can take a look at the PyPI now: <https://pypi.org>

There are literally 10's of thousands of packages available for Python. 

If you wish to install a package in Python, you can install it using the `pip install` command, so for example the command:

```bash
$ pip insall numpy
```

will install the package Numpy. This command should be run from the command line.

However, before you install anything, you should be aware of virtual enviroments.

## Virtual Environments

It is highly recommended that you create virtual environments for each of your projects. 

A virtual environment is basically an isolated Python environment. 

When you run the `python` command from the command line, you are running the system-side installed Python environment. It is recommended that you do not really install many packages to the system Python. 

Instead you should use virtual environments for each project that you are working on. 

Why do this? 

Some projects will require many different packages, and some packages depend on certain versions of other packages. If, for example, a package requires version 1.5 of NumPy, but another project requires at most version 1.4, then you will have a dependency conflict and these can be difficult to solve. Therefore, it is much easier to manage smaller virtual environments for each project. 

Create a virtual environment using Python's built-in `venv` tool. You want to do this in the root directory of your current project. 

```bash
$ python -m venv my-new-virtual-env
```

This creates a new virtual enviroment called `my-new-virtual-env` in the current directory, under the directory `./my-new-virtual-env`.

Within this `./my-new-virtual-env` there is a new, fresh Python environment. All you need to do now is declare this virtual environment as active:

```bash
$ source ./my-new-virtual-env/bin/activate
```

or in Windows:

```cmd
myenv\Scripts\activate
```

and your virtual environment will now be active. Anything you do within this environment will not affect the system version of Python, nor will it affect any other virtual environment.

You will know you are in a virtual enviroment, as your command line shell will change its appearance, and look something like this:

```bash
(my-new-virtual-env) $
```

where the name of the current virtual environment will appears before the prompt.

We will now demo the creation of a virtual environment.

## Printing

You will use the `print()` a lot in Python. 

There are several ways to print using Python, which is slightly annoying as you will see various examples of each whenever you are looking for code. 

The cleanest and most modern way to print is using f-strings (formatted strings), and we will use those exclusively in these seminars.

You define an f-string using `f` before the string's inverted commas:

```python
f"This is a formatted string."
```

You will see this most often within `print()` statements.

With f-strings, you can place variables within your strings using curly brackets:

In [None]:
greeting = 'ATSP 2025 Students'

print(f"Welcome, {greeting}!")

Code within the curly brackets are actually interpreted/executed inline:

In [None]:
x = 3
y = 6
print(f"x: {x}, y: {y}, sum: {x + y}")

One thing you will do a lot is format very long floats in to something more readable. 

Most algorithms return probabilities that are very long, such as 0.78667479273278, which has far too much precision for human reading:

In [None]:
probability = 0.78667479273278472914

print(f"Class probability: {probability}")

Using a `:` you can define how a variable is printed:

In [None]:
print(f"Class probability: {probability:.3f}")

You can specify that you wish to have either 2 or 3 significant digits, and by passing `f` you tell it this is a float and therefore can be rounded.

Large numbers can also be be printed using thousand seperators:

In [None]:
large_n = 7630726492344.8934

print(f"Samples: {large_n:,.2f}")

Often you will work with percentages:

In [None]:
percent = 0.256
print(f"Percentage: {percent:.2%}")

Alignment can be done by specifying the preceding number of digits:

In [None]:
class_ids = [423, 23, 9, 810, 10]

for class_id in class_ids:
    print(f"Class ID: {class_id:3.0f}")

Just to give you an example of how strings can also be printed in Python:

In [None]:
temp = 13
day = 'Thursday'

# Old way
print("On %s it will be %s°C" % (day, temp))

# New way
print(f"On {day} it will be {temp}°C")

The old way gets very difficult if you have 10 variables you want to print. Then you have to literally count the occurances of `%s` to find the variable you want to change and also ensure that they are in the right order in the tuple at the end of the string.

### String's `.center()` function

This is very useful for nicely printing results to the console, or for printing logs and so on during debugging:

In [None]:
accuracy = 0.913
f1_score = 0.876

print(f" Summary ".center(40, '-'))

print(f"Accuracy: {accuracy}")
print(f"f1 Score: {f1_score}")

print(f" End Summary ".center(40, '-'))

## Scientific Notation

You will see scientific notation a lot when using Python for Machine Learning, especially when you are looking at the outputs of models that return probabilities and so on.

To print a number in scientific notation, use `:e`, as follows:

In [None]:
num = 0.100009732

print(f"{num:e}")

If you want to go from scientific notation back to a float:

In [None]:
num = 8.2885e-01

print(f"{num:.8f}")

## Organising Your Code

In Python, a Python file is a module, meaning it can be imported in to the current namespace using `import`. 

For example, we have the following file, `mathtools.py` in our current directory:

```python
# mathtools.py

def square(n):
    return n*n
```

You import this in to a Python session using `import`:

In [None]:
import mathtools

In [None]:
mathtools.square(2)

In [None]:
from mathtools import square

In [None]:
square(5)

This can be very useful if you have a bunch of frequently used functions you have collected over the years and just wish to use them within a notebook. 

## Summary

We have seen:

- Variables
- Functions
- Lists, Sets, Dictionaries, Tuples
- The `for` loop
- `if`/`else` statements
- Virtual environments
- How to use `print()`
- How to organise code in to modules
- How to install Python packages

Any questions so far?

**The most important things to know are data structures such as lists, how to define basic functions, how to loop over lists and dictionaries, how to control flow using `if` and `else`**

---

# Numpy

What is Numpy? You can think of it like an extension for Python which adds support for large matrix and array manipulation. It also provides a large amount of functions to work with these arrays, which we will see shortly. If you know MATLAB, then Numpy will seem very familiar to you. As opposed to Pandas, which deals with Excel style data, tabular data and so on, Numpy deals with numerical data only.

Numpy is by far the most used package in Python in the field of Machine Learning and Data Science. Knowing how it works is essential for any work in this domain. 

Therefore we will dedicate an entire session to Numpy so that you are familiar with it, as it will be used a lot. 

Pandas is also often used, and we will also cover Pandas later today.

## Arrays and Matrices

So what are arrays and matrices?

An array is just a list of numbers. You might also know it has a vector. 

For example:

```python
a = [1, 2, 3, 5, ..., n]
```

A Numpy array is at first glance very similar to a standard Python list, however it has far more functionality and built in features.

Numpy arrays, can be 1-dimensional, like a Python list, or they can be 2D, 3D, or even $n$-dimensional.

A standard Numpy array is 1-dimensional, and is more or less equivalent to a standard Python list, at least at first glance. However they contain much more functionality than a standard Python list.

A 2D Numpy array is like like a table (you can think of it as a list of 1D arrays), and is synonymous with a matrix in mathematics.

A 3D array has a 3rd depth dimension. You can kind of think of it as a list of 2D arrays, and can be compared to a tensor in mathematics.

![numpy](./img/numpy-matrices-crop.png)

*Source*: <https://jovian.com/anujadp4/python-numerical-computing-with-numpy>

Numpy is n-dimensional, so you can define even further dimensions, for example a 4D array is a list of 3D arrays, and so on. In fact you can define up to 32 dimensions, however, in practical applications, it's rare to use arrays with more than 3 or 4 dimensions.

A lot of data in your machine learning tasks are going to be either supplied to you as Numpy data, or you will manipulate this data using Numpy, or the software or algorithms will require input as Numpy data. Therefore, it is an essential skill.

Not only this, but Numpy is extremely fast an efficient compared to the built-in Python lists.

Let's start with importing Numpy:

In [None]:
# It is convention to import Numpy as np
import numpy as np

Let's create a 1D array:

In [None]:
a = np.array([1,2,3,4,5,6,8,9,10])

In [None]:
a

Let's look at its type:

In [None]:
type(a)

You can see it is a Numpy *n*-dimensional array (`ndarray`). In our case, 1D array.

You can verify this using the `ndim` property:

In [None]:
a.ndim

Arrays can be indexed using a zero-based index, which you can do with normal Python lists, however indexing is much more capable which we will see later.

Simple index:

In [None]:
a[0]

The last element can be accessed using `-1`:

In [None]:
a[-1]

Or the 3rd from last element:

In [None]:
a[-3]

So far we have seen nothing that you cannot do with normal lists.

However, if you inspect the list of available functions, you will soon see that Numpy arrrays are much more powerful than standard Python lists. 

We can see a list of functions available using intellisense:

In [None]:
a.any?

Or indeed, we can view the documentation as we have done previously. Documentation is not just for Python built in functions, but also for libraries that you have imported.

In [None]:
np.array?

## Array Broadcasting

Array broadcasting allows you to perform operations over arrays without needing to loop over them.

A simple example illustrates this. We declate a few standard Python lists as follows:

In [None]:
a = [[1, 2, 3],
     [4, 5, 6],
     [7, 8, 9]]

In [None]:
a

In [None]:
b = [10, 20, 30]

Now let's say I want to add `b` to each row of `a`. 

Let's try to use addition:

In [None]:
a + b

That is not what we wanted at all, Python has concatenated the two lists, and we have ended up with a list of 6 elements, where the first 3 elements are lists and the last 3 are the indivdual items from `b`.

So how would we do what we wanted?

We could simply use a `for` loop:

In [None]:
new_matrix = []

for row in a:
    new_row = []
    for element, addition in zip(row, b):
        new_row.append(element + addition)
    new_matrix.append(new_row)

In [None]:
new_matrix

However, we had to write an entire loop, which is error prone and takes a lot of time, etc. 

In Numpy this is much easier. Let's first create Numpy arrays instead of lists:

In [None]:
a = np.array(a)
b = np.array(b)

print(a)
print()
print(b)

And then add them:

In [None]:
a + b

Let's now say we had some values in a list, and you wanted to multiply each value with a set of weights. 

Let's declare these are Python lists:

In [None]:
a = [10, 20, 30]

In [None]:
weights = [1.1, 2.2, 3.3]

In [None]:
a * weights

In Python we get an error.

Again, we will have to write a loop:

In [None]:
weight_sums = []

for i in range(3):
    t = a[i] * weights[i]
    weight_sums.append(t)    

Which produces:

In [None]:
weight_sums

This is what we wanted, but again it required writing this loop, etc.

To do this in Numpy, we can convert the lists to Numpy arrays, and then use the `*` operator on the Numpy arrays:

In [None]:
a = np.array([10, 20, 30])
weights = np.array([1.1, 2.2, 3.3])

a * weights 

It is also worth mentioning that if you use array broadcasting, it can be more than 100x faster than using for loops in Python. The underlying Numpy code is parallel, written in C, and is very optimised.

If you are writing any code that uses loops in Python, and find that it is taking a very long time, see if you can somehow perform the same thing using Numpy's matrix operations.... 

### Matrix Multiplications

We will not use matrix multiplications directly in this seminar, however we will briefly mention it is something you will come across eventually in Machine Learning.

Matric multiplicatons are of the form:

![Matrix](./img/matmul.png)

*Image source*: James, Witten, Hastie, Tibshirani. **An Introduction to Statistical Learning**.

In the case of a matrix multiplication, you need to use the `matmul()` function. 

Let's declare two arrays and take a look at them:

In [None]:
A = np.array([[1,2], [3,4]])
B = np.array([[5,6], [7,8]])

print("A =\n", A)
print("B =\n", B)

Using `matmul()` we can perform this operation quickly:

In [None]:
np.matmul(A, B)

You can also use the `@` symbol for this.

In [None]:
A @ B

### Array comparison

You can also compare arrays quickly, which can come in very handy for when you are comparing the output of a trained model with the groun truth labels, for example.

In [None]:
a = np.array([[1, 2, 3], [3, 4, 5]])
b = np.array([[2, 2, 3], [1, 2, 5]])

In [None]:
a

In [None]:
b

In [None]:
a == b

And the same is true for `>`, `<`, and `!=`, for example:

In [None]:
a > b

## Array Slicing

One of NumPy's biggest strengths is the ability to slice arrays in to subsets very easily.

Remember that indices **include the first element of the index**, and **exlude the last element**. 

Let's start with some simple slicing of a 1D array:

In [None]:
a = np.arange(1, 11)
a

In [None]:
a[0]

In [None]:
a[-1]

We have seen this already, but you can also provide a range of items you wish to retrieve:

In [None]:
a[0:3]

As is Python convention, `0:3` **includes** element 0 and **excludes** element 3.

You can omit the digit before or after the `:` to mean 'from the start' or 'to the end'.

In [None]:
a[:5]

Ranges can also use negative numbers, which we say earlier for selecting the last element. 

Here we go from the last element to the 3rd last element:

In [None]:
a[-5:-1]

You can also select items in steps. You place the step size after the range, in the form `start:stop:step`. 

Let's make a larger array for this, and select only every 10th element:

In [None]:
a = np.arange(100)

In [None]:
a

In [None]:
a[0:100:10]

Or, using no values for start and end, we can select every 2nd element:

In [None]:
a[::2]

### 2D Array Slicing

So far so good with 1D arrays. However, we can do advanced slicing on 2D arrays.

Here are a few exmaples we will cover now:

![Slicing](./img/array-slicing.jpg)

Let's first replicate the dataset from the image above:

In [None]:
a = np.arange(36).reshape(6,6)
a

We use the function `reshape()` to create a 2D matrix out of the 1D array that `arange()` returns.

So, with 2D arrays you specify each slice for each dimension using `,`.

The first dimension specifies the rows you want to select, the second dimension specifies the columns you want to select.

So, we can select the first row like this:

In [None]:
a[0,:]

Note that we used `:` to specify the entire range, as we have seen before. 

So we said, we wanted row 0, all columns.

If we wanted only the first **column**, then we can use the following:

In [None]:
a[:, 0]

In this case, we use `:` to say all rows, and `0` to say the first column.

We can also use negative indexing.  

For example we want only the **last two rows**, and only the **first column**:

In [None]:
a[-2:, 0]

Take a look at the image above to confirm what we have selected.

Ranges can be used for both axes/dimensions. So let's say we wanted only the centre rows and columns, we can use ranges for both axes:

In [None]:
a[2:4, 1:5]

### 3D Slicing

3D slicing is also very common. For example, stacks of images, which are fed in to neural networks, are passed as 3D arrays. Therefore, it is common to index them directly in the stacks in which they are stored using Numpy.

![numpy](./img/numpy-matrices-crop.png)

Axis 0 are the rows, axis 1 are the columns, and axis 2 is the depth. 

However, when indexing 3D arrays, the **order of the axes** must be given as axis 2, axis 0, and axis 1.

You can also think of it like this:

```python
array[depth_slice, row_slice, column_slice]
```

This will become much more clear as we step through some examples.

Note, that this is done in Numpy for memory efficiency reasons. It is faster to index this way as you keep data in more continuous blocks in the computer's memory.

First let's make a data structure:

In [None]:
a = np.array([
    np.arange(36).reshape(6, 6),
    np.arange(36, 36+36).reshape(6, 6),
    np.arange(72, 72+36).reshape(6, 6),
])
a

We can see that the indexing order is different if you look at the `a`'s shape:

In [None]:
np.shape(a)

Say we want the deepest of the arrays, you would use the first index :

In [None]:
a[-1, :, :]

Likewise, we can say we want all the first rows:

In [None]:
a[:,0,:]

Or all the first columns:

In [None]:
a[:, :, 0]

As mentioned before, we will not be using 3D arrays in this seminar, however, if you work with neural networks with image data, you will come across 3D arrays frequently. 

## Searching Numpy Arrays

You will often want to search for particular rows based on condition in Numpy, and this can be done using conditional operators within `[]` in much the same way as we have used index slicing.

For example, consider the following 2D array:

In [None]:
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
a

If you wished to only retrieve the values greater than 5 use the following syntax within the `[]`:

In [None]:
a[a > 5]

Note that it returns an array of **values**, and loses its shape. The array `a` was a 2D array, and we got back a 1D array of values. This is often what you want, but not always.

If you prefer that the indices are returned instead of the values, use the `where` function, which will maintain the shape of the array:

In [None]:
np.where(a>2)

In this case we are returned the indices as a 2D array, one for each dimention. 

So for example, at index 0,2 the element is greater than 2.

We can see this visually or check this:

In [None]:
a[0,2]

Numpy allows you to perform sophisticated searching, however this is beyond the scope of the seminar. It suffices to know that conditional operators can be used in much the same way slcing is performed.

## Aggregate Functions and Basic Statistics

Let's define a 2D array:

In [None]:
arr = np.array([[1, 2, 3], [4, 5, 6]])
arr

We can sum all the elements using `sum()`:

In [None]:
np.sum(arr)

We can sum each row by specifying the axis as `0`:

In [None]:
np.sum(arr, axis=0)

This might seem counter-intuitive, but you are telling Numpy to sum **in the direction of the axis 0**, which is row by row, top to bottom.

To sum column wise:

In [None]:
np.sum(arr, axis=1)

Again, when we specify the axis as being 1, we are saying to sum in the **direction of axis 1**, which is summing left to right, column by column. 

We can calculate the mean as follows:

In [None]:
np.mean(arr)

Or the standard deviation:

In [None]:
np.std(arr)

We can also easily get the maximum and minimum values of the array:

In [None]:
print(np.min(arr)) 
print(np.max(arr)) 

Finally, we can print the cumulative sum of the 2D array:

In [None]:
np.cumsum(arr)

## Summary

We have covered the following:

- Creating 1-dimensional arrays
- Slicing and indexing 1D arrays
- Array broadcasting: perform operations such as `+` or `*` on all elements of an array, optimised for speed 
- Creating 2-dimensional arrays, and slicing them, and the syntax involved
- Creating 3-dimensional arrays, and how to slice and index them
- Learned how to search through arrays based on criteria, e.g. `>5`
- Looked at some built-in aggregate functions and basic statistics functions

Are there any questions? 

---

# Pandas

As mentioned, I am aware that Pandas was covered but if a refresher is required we can cover it now before the afternoon session.

If you think of Numpy as an array and matrix manipulation library, then Pandas, on the other hand, provides the functionality of R's Data Frames. Or you could think of it as Excel for Python. It allows you to manipulate data stored in Excel-like spreadsheets or R-like Data Frames, in a programmatic way. Pandas is best suited for **tabular** data that contains not only numerical data, but all data in the form of text or categorical data.

Therefore, unlike NumPy, Pandas DataFrames have column names for example, and you can access individual columns using the column name, while in NumPy you access columns using indices. 

Also, Pandas can directly read Excel files and it provides convient ways of importing Excel spreadsheets data using functions such as `read_excel()`. You can also open SAS, SPSS, and SQL databases.

We will now demonstrate the use of Pandas, by opening an Excel file that contains multiple sheets (we can take a look at it in Excel first). The Excel file relates to a catalog of movies, organised by genre, year, and so on.

In [None]:
import pandas as pd  # Convention 

# See https://www.dataquest.io/blog/excel-and-pandas/
movies = pd.read_excel('./data/movies.xls')

In [None]:
movies

If we were to look at the Excel file, we would see it has multiple sheets. 

Be default Pandas will open the first sheet, 0. Sheets are 0-indexed. You can specify the sheet you would wish to import using `sheetname`, so as follows:

In [None]:
movies_2000s = pd.read_excel('./data/movies.xls', sheet_name=1)

If you know the name of the sheet, this is equivalent:

In [None]:
movies_2000s = pd.read_excel('./data/movies.xls', sheet_name='2000s')

In [None]:
movies_2000s.head()

The first column contains an automatically created index, if we wanted we could use the film title as an index instead:

In [None]:
movies_2000s = pd.read_excel('./data/movies.xls', sheet_name=1, index_col=0)
movies_2000s.head()

As you can see, it is very easy to load Excel data using Pandas. Numpy would not be suitable for this task, as must of the data in this movies spreadsheet is not numerical data.

### Iris

Let's move to another dataset, namely the Iris dataset.

The Iris dataset consists of 150 observations of 3 species of iris plant, namely setosa, virginica, and versicolor (50 of each).

In [None]:
iris_df = pd.read_csv('data/iris.csv', sep=',', header=None)

As this is a CSV file, we have specified that the values are separated by the comma `,` symbol, and that the file's first line does not contain a header.

We can preview the data within Jupyter to confirm this.

Now that the data has been loaded in to a Pandas DataFrame, we can view it:

In [None]:
iris_df

The `head()` and `tail()` functions are useful for previewing your data:

In [None]:
iris_df.head()

Pandas DataFrames allow you to add column names to the dataset (this is also not possible with Numpy arrays):

In [None]:
iris_df.columns = ["sepal_length", 'sepal_width', 'petal_length', 'petal_width', 'class']

In [None]:
iris_df

You can use the `describe()` function to get a quick look overview of some of the statistics of the data, such as the mean value per column:

In [None]:
iris_df.describe()

Note that the `class` column is not included as it does not contain numerical data.

The `info()` function provides further information:

In [None]:
iris_df.info()

This shows you information as to the data types, whethe there is any missing information, and so on.

### Slicing Data

You can select rows and columns based on the names of the columns, in contrast to Numpy where only numerical indices are allowed. For example if you wanted to view the `sepal_length` column:

In [None]:
type(iris_df['sepal_length'])

If your column name does not contain any spaces or special characters, you can use the dot (`.`) notation to select columns:

In [None]:
iris_df.sepal_width

You can also select multiple columns by passing a list:

In [None]:
iris_df[['petal_length', 'class']]

You can index rows numerically using the `iloc` attribute.

Here we select the first row of the dataset:

In [None]:
iris_df.iloc[0]

Indices are 0-based.

You can also select multiple rows in this manner, by passing a list of rows you wish to retrieve:

In [None]:
iris_df.iloc[[1,2,3]]

Or indeed a range, here we are selecting rows 0 to 9:

In [None]:
iris_df.iloc[0:10]

And combine this with a list of column names, as follows:

In [None]:
iris_df.iloc[0:10][['petal_length', 'class']]

For non-continuous rows, you can use the `np.r_` helper to specify ranges to pass to `iloc`:

In [None]:
iris_df.iloc[np.r_[2:5, 7:10]]

A common task is to sample from your dataset based on conditions. For example, you may want all rows where the sepal length is greater than 7cm. 

If you run the following:

In [None]:
condition = iris_df.sepal_length > 7

you will get back a list with True/False masks:

In [None]:
condition

And this can be passed as a condition to your DataFrame:

In [None]:
iris_df[condition]

If you use `loc` you can specify the condition, and a list of columns that you want:

In [None]:
iris_df.loc[condition, ['sepal_length', 'class']]

You can make even write complex multiple condition statements. 

For example you might want only those rows where the sepal length is greater than 5cm and the petal length is greater than 6cm:

In [None]:
condition_multiple = (iris_df['sepal_length'] > 5) & (iris_df['petal_length'] > 6)

In [None]:
condition_multiple

The `condition_multiple` does not contain the data, it contains a mask that can be passed as an index to the DataFrame:

In [None]:
iris_df[condition_multiple]

Combine this with `loc` once again, and you can specify a subset of columns that you want returned:

In [None]:
iris_df.loc[condition_multiple, ['sepal_length', 'petal_length', 'class']]

Hopefully this gives you an overview of how to use Pandas. We will be using Pandas throughout the course, and these basics should be enough for you to unserstand any of the Pandas code you will see later.

--- 

# Links and Resources

A good guide for how to use Python's built in `help()` function, see <https://www.pythonmorsels.com/help-features/> 

# End of Morning Session

We will continue after a break