# Day 1: Morning Session

# Introduction

Welcome to Python section of the Advanced Topics in Scientific Programming (ATSP) 2024 course! 

We have two days of Python, where we will discuss the following:

- Day 1 will focus on basics of Python, Numpy, Jupyter, and then machine learning methods using Sci-Kit Learn
    - As I understand it, you you have already done Pandas in the introduction course?
    - If we need a refresher on Pandas, we can do so quickly in this morning session now. 
- Day 2 will focus on Deep Learning, model deployment, and your assignment

Your assignment will be to complete an end to end machine learning project - you will have the choice of several assignments. We will go in to more details regarding the assignment tomorrow. One of the assignments will be an image classification assignment using deep neural networks. Another assignment will relate to text or tabular dataset.

## Admin and Materials
Let's first discuss some administrative matters and the material for the course.

We provide the materials as follows: 

- All course materials are provided and presented as Jupyter notebooks
- Jupyter is a web-based development enviroment for Python, R, etc.
- During the seminars, we will use this server to run and execute the code live
- Each of you has an account so you can log in to the server and access the materials and follow along while we code, you can also experiment with the code, etc.
- The server will only run during the seminar times. Outside of the seminars, you will be able to access the notebooks via our GitHub repository

This means that to run the notebooks outside of the seminar times, you will need to install and run Jupyter locally, or use something like Google Colab, which we discuss in detail tomorrow. Tomorrow afternoon, we will discuss your assignment, and in this session we can set up Jupyter on your local machines, etc.

### Day 1

Today we will discuss the following topics:

- What is Jupyter and how to use it
- Basics of Python
- Basics of NumPy
- Basics of Pandas, if required?
- Machine Learning basics
    - Classification, regression / Supervised vs. Unsupervised
    - Overfitting and underfitting
    - Model evalutation
- SciKit Learn
    - How to get data
    - How to generate data
    - Training, testing
    - Cross validation
    - Model persistence
    - Scaling and normalisation
    - Data preprocessing

We will split this content into the morning and afternoon sessions. 

So, in a bit more detail, first we discuss Jupyter, and how to use it. We will use Jupyter almost entirely for this course, to cover the material, and to run code examples and so on. You can follow along and try code examples. Jupyter is a good skill to have if you work in data science, machine learning, or academic research in general. For example, it is very useful for reproducibility and for organising code, presenting work, and for development work when you are working on a project. You can use it as an IDE, and completely replace the use of a text editor. It is web-based, meaning all you need is a browser to use it. And it can be used remotely, for example you can code at home while accessing a server at work or anywhere else in the world. It is worth noting that Jupyter is not only for Python development, there are so-called kernels for dozens of languages, including R, Julia, MATLAB, Mathematica, SPSS, you name it! 

After covering Jupyter and how to use it, we cover the basics of Python. Python is a general purpose language, which makes it a very useful language to know - and of course it often used in the context of Machine Learning tasks, but it completely dominates as the language of choice for Deep Learning. If you are doing any kind of Deep Neural Network training, you will most likely be using Python and frameworks such as Torch and TensorFlow. So in this section we will cover the basics of Python, such as functions, lists, classes, etc. 

Once we have covered the basics of Python, then we will cover some tips and tricks for using Python in the context of Machine Learning, and cover some aspects of the language that make it useful for Data Science work in general.

Next, we cover Numpy, which is a framework for handling data in Python. It provides essential functionality for manipulating array and matrix data and is used in practically any machine learing project. If required, we can cover Pandas as well, although I understand that this has already been covered in the 

After we have covered the basics of Python programming, Jupyter, and NumPy, we will move on to Machine Learning methods using Python. The focus of this course is practical Machine Learning. We will not discuss in detail how the algorithms we cover work, it is focussed on how to train algorithms, what are best practices, where to get data, etc. and we will be mostly using SciKit-Learn for this.

**Please note**: these notebooks are rather text-heavy: this is by design. The reason being, is that it should be possible for you to download and read and go through all these noteooks completely independently after the course has finished. Hence they should be self explanatory and be useful as a reference later, if you wish to go back to a topic (for your assignment, for example).

### Day 2

Day 2 will focus on Neural Networks and Deep Learning, as well as model deployment.

Topics include:

- Simple neural networks: we will use PyTorch to define a simple neural network and train it
- PyTorch: we will discuss the framework we will use to create neural networks
- Deep learning: we move to deep networks, the basis for all of the recent advancements, such as generative models and GPT models
- Image classification: we will train an image classifier on a number of small tasks
- Image segmentation: we discuss image segmentation in the context of medicine
- Pre-trained models: use networks that have already been trained
- Fine-tuning models: adapt a pre-trained network to your specific task 
- Model deployment and web application development
- Assignment

## Get Started

So, as mentioned, this course we are going to be using Jupyter almost exclusively. We have set up a server so that you can follow along with the examples that we cover here.

The server can be accessed via a link that will be shown on the projector now!

Do not worry if there is some warning about insecure HTTP: access to this server is limited to the MUG network and KAGes. No one else can access this server from outside of the MUG or the KAGes network.

To log in, use s + matriculation number, e.g. `s07316801` and the password that will be shown on the whiteboard. 

When you log in, you will see the notebooks for the course. If you do not see any notebooks, then something went wrong! Most likely you have entered an incorrect matriculation number. Let's make sure everyone is logged in before we continue.

Note that this server will only be running during the seminar times! Outside of seminar times, you can download the notebooks from the GitHub repository.

Outside of the seminar times, you will need to either install Jupyter on your machines, or use one of the online services that we will cover later in the course. 

For now however, we will use this server for the next two days during the course.

The material (for example these notebooks, plus other material from the R part of the course, and datasets and so on) are all available at any time under the following link:

- <https://github.com/imigraz/ATSP-2024>

Let's have a look at the repository now. You can download the contents of the repository as a zip file, or you can clone the repository using Git.

## Exercises

If we have time, there are some small exercises along the way that you can complete. However, because it is unclear if we will have time for these, the grading will be based only the assignment that you have to complete.

---

# Jupyter

This environment that we are using now is called Jupyter. It provides an interactive web-based environment called notebooks. 

It is worth noting that Jupyter is not just for Python development. There are so-called kernels from dozens of languags, see <https://github.com/jupyter/jupyter/wiki/Jupyter-kernels>, including R. Also, Jupyter is used widely in acaedmia to disseminate results and experiments. It is therefore a very useful tool to aid reproducibility. A particularly good example of this can be found here: <https://www.nature.com/articles/ng.3051> - you will find that there are a number of Jupyter notebooks associated with the paper, and they can be used to reproduce the exact same plots found with in the paper itself: <https://github.com/theandygross/TCGA/>. If you supply your experiments as a Jupyter notebook, anyone can download the notebooks, and run them on their local machine, confirming your results. This can greatly enhance your chances of getting a paper published! 

GitHub natively supports Jupyter, so you can host notebooks for free there. Example: <https://github.com/mdbloice/Augmentor/blob/master/notebooks/Augmentor_Keras.ipynb>. GitHub is often used to host Jupyter notebooks for a publication, as we saw above. Note that GitHub hosts notebooks **statically**, meaning they cannot be executed. You can view them, and download them, but not run them. 

Back to the structure of a notebook. Notebooks can contain text, images, and also code. When you execute Python code, the code is sent to a server (which can running on your local machine or elsewhere), and the results or output of the code appear in the notebook.

Therefore, there are different types of cells. Text cells and code cells. 

This is a text cell. Below we have a code cell:

In [1]:
print("A code cell is executed and the results are printed to the notebook!")

A code cell is executed and the results are printed to the notebook!


The return result does not neccessarily only contain textual or numerical data, they can also be images or plots, of course.

## Jupyter Basic Usage

I will demonstrate the following now:

- Create new cells
- Making a cell active 
- Executing a cell
- Changing the cell type
- Placing a cell above and below the current position
- Hiding the output of a cell
- Moving a cell up or down

Also, the IDE itself allows you to use tabs, open terminals, view data, and so on. It is a proper IDE, where you can do most of work from! 

- Create a new notebook
- Browse notebooks
- Open a terminal
- Open a dataset

In the menu there are a few options to look at:

- **Run** menu
    - Run a cell
    - Run all cells
- **File** menu
    - Download
    - Convert
- **Kernel** menu
    - What is a kernel?
    - Interrupt kernel
    - Restart kernel

## Jupyter Tips and Tricks

Ok, so we have seen the basic usage, now let's discuss some mor advanced features you may not know about, even if you use Jupyter a lot.

### Magics 

You can execute magics using `%` and `%%`. 

Line magics begin with a single `%` and they apply only to that one line.

Cell magics begin with double `%%` and apply to the entire cell.

#### Line Magics

For example, to see a history of your previous commands, use `%history`:

In [2]:
%history -l 10

import torchinfo

torchinfo.summary(model)
torchinfo.summary(model)
from tqdm.notebook import tqdm

for epoch in range(number_of_epochs):
    train_correct = 0
    train_total = 0
    test_correct = 0
    test_total = 0
    
    model.train()
    for inputs, targets in tqdm(train_loader):
        # forward + backward + optimize
        optimizer.zero_grad()
        outputs = model(inputs)
        
        #if task == 'multi-label, binary-class':
        #    targets = targets.to(torch.float32)
        #    loss = criterion(outputs, targets)
        #else:
        targets = targets.squeeze().long()
        loss = criterion(outputs, targets)
        
        loss.backward()
        optimizer.step()
from sklearn.metrics import classification_report

y_pred_cr = []
y_true_cr = []

def test(split):
    model.eval()
    y_true = torch.tensor([])
    y_score = torch.tensor([])
    
    data_loader = train_loader if split == 'train' else test_loader

    with torch.no_grad():
        for input

The `-l 10` argument prints only the last 10 commands.

Another useful one is the `%whos` command, which lists all currently stored variables and some information about each one.

In [3]:
%whos

Interactive namespace is empty.


If you only want strings, use this:

In [4]:
%whos list

No variables match your requested type.


#### Cell Magics

These operate on the entire cell. 

A very useful one is the `%%timeit` cell magic:

In [5]:
%%timeit

a = [1,2,4]
b = 6

c = a * b

48.7 ns ± 0.0404 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)


Or, write the entire contents of the cell to a file on the disk:

In [6]:
%%writefile atsp.txt

hello, world

Overwriting atsp.txt


Use `-a` to append to a file:

In [7]:
%%writefile -a atsp.txt

welcome to the course

Appending to atsp.txt


This is very useful for writing logs!

To see a list of all magics, use `%magic`. Be warned, this produces a lot of output! However, we can use this to demonstrate how to hide the output of a cell.

In [8]:
%magic


IPython's 'magic' functions

The magic function system provides a series of functions which allow you to
control the behavior of IPython itself, plus a lot of system-type
features. There are two kinds of magics, line-oriented and cell-oriented.

Line magics are prefixed with the % character and work much like OS
command-line calls: they get as an argument the rest of the line, where
arguments are passed without parentheses or quotes.  For example, this will
time the given statement::

        %timeit range(1000)

Cell magics are prefixed with a double %%, and they are functions that get as
an argument not only the rest of the line, but also the lines below it in a
separate argument.  These magics are called with two arguments: the rest of the
call line and the body of the cell, consisting of the lines below the first.
For example::

        %%timeit x = numpy.random.randn((100, 100))
        numpy.linalg.svd(x)

will time the execution of the numpy svd routine, running the assignment 

### Execute Commands
You can execute commands from the command line using the exclamation mark at the beginning of a line:

In [9]:
! ls

 Assignment.ipynb	    dog.jpg
 atsp.txt		    img
 Augmentation.ipynb	   'Lin Reg Examples.ipynb'
'Balancing Dataset.ipynb'   mathtools.py
 CIFAR10.ipynb		    model.pickle
 classifier.pickle	   'Normalisation and Scaling.ipynb'
 data			    old-notebooks
'Day 1 - Afternoon.ipynb'   __pycache__
'Day 1 - Morning.ipynb'    'Removed Sections.ipynb'
'Day 2 - Afternoon.ipynb'  'Scientific Notation.ipynb'
'Day 2 - Morning.ipynb'     XGBoost.ipynb


In [10]:
! df -h

Filesystem      Size  Used Avail Use% Mounted on
tmpfs            13G  4.6M   13G   1% /run
/dev/nvme0n1p2  1.8T  482G  1.3T  28% /
tmpfs            63G  1.2G   62G   2% /dev/shm
tmpfs           5.0M  4.0K  5.0M   1% /run/lock
/dev/nvme0n1p1  511M  6.1M  505M   2% /boot/efi
/dev/sda        7.3T  575G  6.3T   9% /home/mblo/Samsung8TB
tmpfs            13G  340K   13G   1% /run/user/1000
tmpfs            63G     0   63G   0% /run/qemu


### Documentation
Use `?` to access documentation of any function or class

For example, we can view the documentation of the build in Python function `open()` which is used to open files on the disk.

In [11]:
open?

[0;31mSignature:[0m
[0mopen[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfile[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmode[0m[0;34m=[0m[0;34m'r'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbuffering[0m[0;34m=[0m[0;34m-[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mencoding[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0merrors[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnewline[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mclosefd[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mopener[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Open file and return a stream.  Raise OSError upon failure.

file is either a text or byte string giving the name (and the path
if the file isn't in the current working directory) of the file to
be opened or an integer file descriptor of the file

**Note**: you do not write the parentheses:

In [12]:
open()?

SyntaxError: invalid syntax (3190824969.py, line 1)

### Autocompletion / Intellisense

Use `Tab`  to autocomplete your code, such as showing you a list available functions: 

In [13]:
a = "some text"

In [14]:
a.capitalize()

'Some text'

Use `Shift + Tab` to bring up the documentation and the parameters for that function:

In [15]:
a.capitalize()

'Some text'

## Keyboard Shortcuts

In the help menu, under Help -> Short Keyboard Shortcuts 

Some I use a lot:

- `Esc`: exits edit mode.
- `Enter`: enter edit mode.
- `A`: create a new cell above
- `B`: create a new cell below
- `D` + `D`: delete current cell (hit `D` twice in quick succession)
- `Y`: Change cell to code cell
- `M`: Change cell to text/markdown cell
- `Shift` + `L`: show line numbers. This is useful for error messages. 

### Markdown

As mentioned, text cells in Jupyter can be formatted using Markdown. 

Here is the most important Markdown formatting syntax:

Headings:
```
# Header 1
## Header 2
### Header 3
```

Text formatting:

```
Some **bold** text
Some *italics* text 
```

Lists:
```
- a list
- of unordered
- items

1. a list
2. of ordered
3. items

- [x] a list
- [ ] of todo
- [ ] items
```

Code:
```
A `code` sample
```
Blocks of code use 3 backticks.

Horozontal Rule:

```
---
```

Links:
```
[title](https://www.example.com)
```
or 
```
<https://example.com>
```

Images:
```
![Text Description](image.jpg)
```

See the following guide for a comprehensive list: <https://www.markdownguide.org/cheat-sheet/>

### Interactivity

We will not cover this, however it is possible to run interactive widgets in Jupyer notebooks. 

See <https://ipywidgets.readthedocs.io/en/stable/> for more details. 

Using interactive elements you can have sliders for example, where variables can be adjusted visually and so on.

---

# Basics of Python

Good guide https://swcarpentry.github.io/python-novice-gapminder/04-built-in.html

- Variables
- Types
- Comments
- Built-in Functions
- Writing Functions
- Classes
- Modules

## What is Python

Python is a general purpose programming language. This differs somewhat from R which is much more focussed on statistics and data analysis. For example in Python you can write GUI applications, web applications, server-side applications, as well as use it for systems administration, machine learning, or statistics. It is a very broad language.

Therefore, Python has become probably the most popular language in the world, with the exception of perhaps JavaScript, which is used for web development (and is not as general purpose as Python).

We will see from the next section that it is a very simple programming language, with minimal syntax.

## Variables

You declare variables in much the same way as in R:

In [16]:
a = 2

Python is **dynamically typed**. This means in Python you do not need to declare which **types** you are using, meaning that you do not need to declare that the variable above is an integer.

In other languages you would need to say:

```java
// Java code
int age = 21;
String name = "Marcus";
```

What you are doing here is telling Java that `age` will contain an integer and `name` will contain a string.

In programming a **type** is the kind of variable. All programming languages will have types such as integers (whole numbers, -3, -2, -1, 0, 1, 2, 3), floats (real numbers, 3.14 or 0.00005), strings (text, "example"), collections such as **lists**, **dictionaries**, and **tuples**,  and others.

Being dynamically typed doesn't mean Python doesn't have types at all, it just means that they are automatically assigned based on the data. You can find a variable's type at any time using the `type()` function:

In [17]:
type(a)

int

In [18]:
type("Example")

str

Therefore, if you try to add two variables of different types, Python will try to work out what you are doing:

In [19]:
a = 2
b = 4.5

print(a + b)

6.5


In [20]:
type(a)

int

In [21]:
type(b)

float

Some languages will throw an error for trying this.

This automatic type coversion can only go so far, however, as the following demonstrates:

In [22]:
a = 2
b = "4.5"
print(2 + b)

TypeError: unsupported operand type(s) for +: 'int' and 'str'

The solution to this is to **cast** your variable:

In [23]:
a = 2
b = "4.5"

print(2 + float(b))

6.5


## Collections

There are 4 collection types in Python, and you will come across all of them in Machine Learning tasks.

You can think of collections as lists of items, and each of the 4 collection types have their own distinct properties and characteristics.

The 4 collection types are:

- Lists
- Dictionaries
- Sets
- Tuples

### Lists

Lists are unsurprinsingly lists of objects. They can be numerical or text, or mixed. 

- Lists are **mutable**, that means they can be changed after they have been declared. 
- Lists are **ordered**, meaning their order is defined meaning they will maintain the order in which you added the elements, unless you explicitly change this. 
- You create lists by placing items inside square brackets `[]`, separated by commas.
- Lists **allow duplicates**: they can contain multiple occurrences of the same element.

Use lists when you have a collection of items where order matters and you might need to change the items (add, remove, or change).

### Sets

Sets are like mathematical sets. Here are some of the properties of sets:

- **Mutable**: You can add or remove elements from a set
- **Unordered**: Sets do not record element position or order of insertion. Why might ask why? This can make them very fast and are useful for containing many millions of elements.
- You create sets by placing items inside curly brackets {}, or using the `set()` function.
- **Duplicates are not allowed**: A set cannot contain multiple occurrences of the same element.

Usage: Use sets when you need to ensure that an element only appears once in a collection, often used for membership testing, removing duplicates (if you convert a list in to a set, you can very quickly get all unique elements in the list with one quick statement). Also, they are used for mathematical set operations, such as intersection, union, and difference.


### Dictionaries

Dictionaries are different in that they are arranged in key-value pairs. We will see dictionaries often later.

- **Mutable: You can change, add, or delete key-value pairs after the dictionary is created.
- Ordered, since Python 3.7 - generally this does not matter, as dictionaries are normally accessed via their keys.
- Syntax: Created with curly brackets {} containing key-value pairs separated by colons, in the form {'key': 'value'}.
- **Unique Keys**: The keys have to be unique, but values can be duplicated.

Usage: Use dictionaries when you need to associate a  unique key with a values.

We will see dictionaries used a lot during the course, but here is an example, so that it is clear what we mean by key-value pair:

```python
d = {'name': 'marcus', 'age': 21}
```

You can access a particular value using its key, as follows:

```python
d['name']
>>> marcus
```

### Tuples

Tuples are immutable (unchangeable), ordered, and can contain duplicate items.

- **Immutable**: Once a tuple is created, you cannot change it, cannot add elements to it, or remove elements from it
- Ordered: The items have a defined order
- Syntax: Created by placing items inside parentheses `()`, separated by commas
- Duplicates are allowed: Tuples can contain multiple occurrences of the same element

Usage: Use tuples when you have an ordered collection of items that should never change! 

## Control Flow

So you should of course all be aware of if, else, elif:

```python
if a < b:
    print('Computer says no')
else:
    print('Computer says yes')
```

and 

```python
if a == '+':
    op = PLUS
elif a == '-':
    op = MINUS
elif a == '*':
    op = TIMES
else:
    op = UNKNOWN
```

## Loops

The `for` loop in Python is most common type of loop. It simply iterates over all elements of a list for example. 

We will see this many times, during the seminar.

If you are looping over a list, a very useful function to know is `enumerate()`. This goes over the entire list, but also returns its index. I will cover this later.

There is also the `while` loop, which we do not come across during this seminar.

## Functions

Functions are very similar to those in R or other languages. 

You define a new function using `def`:

```python
def my_function(name):
    print(f"Hello {name}")
```

and call a function as follows:

```python
my_function('marcus')
>>> Hello marcus
```

## Packages

Python packages are available via the *Python Package Index*, know simply as *PyPI* and packages can be installed using `pip`, which installs packages from the PyPI repository. 

We can take a look at the PyPI now: <https://pypi.org>

You install a package using the `pip install` command, so for example:

```
$ pip insall numpy
```

However, before you install anything, you should be aware of virtual enviroments.

## Virtual Environments

It is highly recommended that you create virtual environments for each of your projects. 

So what is a virtual environment? Basically it is an isolated Python environment. 

When you run the `python` command from the command line, you are running the system-side installed Python environment. It is recommended that you do not really install many packages to the system Python. 

Instead you should use virtual environments. Basically, for each project that you are working on, you should use a seperate virtual environment. 

Why do this? 

Some projects will require many different packages, and some packages depend on certain verions of other packages. If, for example, a package requires version 1.5 of NumPy, but another project requires at most version 1.4, then you will have a dependency conflict and these can be difficult to solve. Therefore, it is much easier to manage smaller virtual environments for each project. 

Create a virtual environment using Python's built-in `venv` tool. You want to do this in the root directory of your current project. 

```bash
$ python -m venv my-new-virtual-env
```

This creates a new virtual enviroment called `my-new-virtual-env` in the current directory, under the directory `./my-new-virtual-env`.

Within this `./my-new-virtual-env` there is a new, fresh Python environment. All you need to do now is declare this virtual environment as active:

```bash
$ source ./my-new-virtual-env/bin/activate
```

or in Windows:

```cmd
myenv\Scripts\activate
```

and your virtual environment will now be active. Anything you do within this environment will not affect the system version of Python, nor will it affect any other virtual enviroment.

You will know you are in a virtual enviroment, as your command line shell will change its appearance, and look something like this:

```bash
(my-new-virtual-env) $
```

where the name of the current virtual environment will appears before the prompt.

## Python Tips

Here are a few tips that you should be aware of that I have found useful after many years of Python programming :) 

### Python Interactive Mode

You know you can run Python scripts from the command line using: 

```bash
$ python my_script.py
hello world
$
```

However, it can be difficult to debug these scripts, as if you encounter an error, the script will terminate and you will be dumped back to the command line. 

What is much more useful is to use `-i` to run a script in interactive mode: 

```bash
$ python -i my_script.py
>>>
```

This runs the script, but keeps you within an interactive session, so that you can inspect variables and so on.

### Looping Over A Collection

Imagine you have a list of colours: `["red", "blue", "green", "orange"]` and you wanted to loop over it, you might be inclined to write something like:

In [24]:
colours = ["red", "blue", "green", "orange"]

for i in range(len(colours)):
    print(colours[i])

red
blue
green
orange


If you came from C or some other language, you might do this intuitively. 

In Python this is much easier:

In [25]:
for colour in colours:
    print(colour)

red
blue
green
orange


### Removing items from a list

One thing that can catch you out often, is removing items from a list while you are interating over it:

```python
for colour in colours:
    if colour == "red":
        colours.remove(colour)
```

This will cause lots of issues. Never add or remove items from a list you are iterating over! Instead use `copy()`:

In [26]:
for colour in colours.copy():
    if colour == "red":
        colours.remove(colour)

In [27]:
colours

['blue', 'green', 'orange']

### The `reversed` keyword

How would you iterate over a list backwards?

You might say something like:

In [28]:
for i in range(len(colours)-1, -1, -1):
    print(colours[i])

orange
green
blue


This works, but it is much easier to use the `reversed` keyword:

In [29]:
for colour in reversed(colours):
    print(colour)

orange
green
blue


### Looping with indices

Notice above we looped over the items and didn't have their indices. What if you needed their indices? You could say:

In [30]:
colours = ['red', 'green', 'yellow', 'orange', 'purple']

for i in range(len(colours)):
    print(i, " --> ", colours[i])

0  -->  red
1  -->  green
2  -->  yellow
3  -->  orange
4  -->  purple


This is the C way. In Python just use `enumerate`:

In [31]:
for i, colour in enumerate(colours):
    print(i, " --> ", colour)

0  -->  red
1  -->  green
2  -->  yellow
3  -->  orange
4  -->  purple


### The `zip` keyword

Imagine you had two lists:

In [32]:
names = ["Matilda", "Theo", "Liam"]
colours = ['red', 'green', 'yellow', 'orange', 'purple']

And you want to loop over both lists, pairwise, one at a time. How would you do this?

You could say:

In [33]:
n = min(len(names), len(colours))

for i in range(n):
    print(names[i], " --> ", colours[i])

Matilda  -->  red
Theo  -->  green
Liam  -->  yellow


This works but in Python there is of course a better way, by using the `zip` keyword:

In [34]:
for name, colour in zip(names, colours):
    print(name, " --> ", colour)

Matilda  -->  red
Theo  -->  green
Liam  -->  yellow


If the lists are differnet lengths, it operates on the length of the shortest list:

In [35]:
list(zip(['a','b','c'], [1,2]))

[('a', 1), ('b', 2)]

### Looping in sorted order

You can use the `sorted()` function to loop in sorted order: 

In [36]:
for colour in sorted(colours):
    print(colour)

green
orange
purple
red
yellow


What if you wanted reverse order? You can pass the argument `reverse=True`:

In [37]:
for colour in sorted(colours, reverse=True):
    print(colour)

yellow
red
purple
orange
green


### Removing Items from a Dictionary

It is very easy to loop over a dictionary in Python. Imagine you had the following dictionary:

In [38]:
d = {'matilda': 'red', 'oscar': 'green', 'liam': 'yellow'}

In [39]:
for k in d:
    print(k)

matilda
oscar
liam


Notice that `k` contains the keys, as this is often what you want to loop over, and access it's contents within the loop.

If you want to delete an item as you loop over it, use `list()` to get a copy of the dictionary's keys, and then use `del()`:

In [40]:
for k in list(d.keys()):
    if k.startswith('l'):
        del d[k]

In [41]:
d

{'matilda': 'red', 'oscar': 'green'}

Again, you don't want to be deleting items from anything you are lopping over.

### The `range()` function

The `range()` function is very useful for creating arrays of dummy data, etc.

For example, for the numbers 0 to 9:

In [42]:
for i in range(10):
    print(i)

0
1
2
3
4
5
6
7
8
9


When passing only one value n, you get the values 0 to n, not including n. 

The function actually takes the form `range([start,] end [,step])` so that you can say the following also, for example:

In [43]:
for j in range(10,20):
    print(j)

10
11
12
13
14
15
16
17
18
19


Here you passing the start and end values. It includes the start value, but does not include the end value.

Also you can specify a step also, for example:

In [44]:
for k in range(10,20,2):
    print(k)

10
12
14
16
18


## The `any()` and `all()` functions

The `any()` and `all()` functions are often forgotten even to experienced Python programmers:

In [45]:
a = [False, True, True, True]
any(a)

True

In [46]:
all(a)

False

This works not only for Booleans, but also for what might commonly evaluate as true false:

In [47]:
a = [0, 0, 0, 0, 1]
any(a)

True

In [48]:
all(a)

False

## Printing

There are several ways to print using Python, which is slightly annoying as you will see various examples of each whenever you are looking for code. 

The cleanest and most modern way to print is using f-strings (formatted strings), and we will use those exclusively in these seminars.

You define an f-string using `f` before the string's inverted commas:

```python
f"This is a formatted string."
```

You will see this most often within `print()` statements.

With f-strings, you can place variables within your strings using curly brackets:

In [49]:
greeting = 'ATSP 2024 Students'

print(f"Welcome, {greeting}!")

Welcome, ATSP 2024 Students!


Code within the brackets are actually interpreted:

In [50]:
x = 3
y = 6
print(f"x: {x}, y: {y}, sum: {x + y}")

x: 3, y: 6, sum: 9


One thing you will do a lot is format very long floats in to something more readable. 

Most algorithms return probabilities that are very long, such as 0.78667479273278, which has far too much precision for human reading:

In [51]:
probability = 0.78667479273278472914

print(f"Class probability: {probability}")

Class probability: 0.7866747927327847


Using a `:` you can define how a variable is printed:

In [52]:
print(f"Class probability: {probability:.3f}")

Class probability: 0.787


You can specify that you wish to have either 2 or 3 significant digits, and by passing `f` you tell it this is a float and therefore can be rounded.

Large numbers can also be be printed using thousand seperators:

In [53]:
large_n = 7630726492344.8934

print(f"Samples: {large_n:,.2f}")

Samples: 7,630,726,492,344.89


Often you will work with percentages:

In [54]:
percent = 0.256
print(f"Percentage: {percent:.2%}")

Percentage: 25.60%


Alignment can be done by specifying the preceding number of digits:

In [55]:
class_ids = [423, 23, 9, 810, 10]

for class_id in class_ids:
    print(f"Class ID: {class_id:3.0f}")

Class ID: 423
Class ID:  23
Class ID:   9
Class ID: 810
Class ID:  10


Just to give you an example of how strings can also be printed in Python:

In [56]:
temp = 13
day = 'Thursday'

# Old way
print("On %s it will be %s°C" % (day, temp))

# New way
print(f"On {day} it will be {temp}°C")

On Thursday it will be 13°C
On Thursday it will be 13°C


The old way gets very difficult if you have 10 variables you want to print. Then you have to literally count the occurances of `%s` to find the variable you want to change and also ensure that they are in the right order in the tuple at the end of the string.

### String's `.center()` function

This is very useful for nicely printing results to the console, or for printing logs and so on during debugging:

In [57]:
accuracy = 0.913
f1_score = 0.876

print(f" Summary ".center(40, '-'))

print(f"Accuracy: {accuracy}")
print(f"f1 Score: {f1_score}")

print(f" End Summary ".center(40, '-'))

--------------- Summary ----------------
Accuracy: 0.913
f1 Score: 0.876
------------- End Summary --------------


## Scientific Notation

You will see scientific notation a lot when using Python for Machine Learning, especially when you are looking at the outputs of models that return probabilities and so on.

To print a number in scientific notation, use `:e`, as follows:

In [58]:
num = 0.100009732

print(f"{num:e}")

1.000097e-01


If you want to go from scientific notation back to a float:

In [59]:
num = 8.2885e-01

print(f"{num:.8f}")

0.82885000


## Organising Your Code

In Python, a Python file is a module, meaning it can be imported in to the current namespace using `import`. 

For example, we have the following file, `mathtools.py` in our current directory:

```python
# mathtools.py

def square(n):
    return n*n
```

You import this in to a Python session using `import`:

In [60]:
import mathtools

In [61]:
mathtools.square(2)

4

In [62]:
from mathtools import square

In [63]:
square(5)

25

This can be very useful if you have a bunch of frequently used functions you have collected over the years and just wish to use them within a notebook. 

### Using `-i` for Scripting

Most of the time, if you are using Python for data science you will most likely be using either Jupyter notebooks to at least prototype your code, and later perhaps consolidate your code in to scripts you can run from the command line. For example, very long running scripts should be run from the command line so that you do not accidentally close a browser window, or something like that. If the script outputs a lot of content to the command line, then it makes much more sense to run it in a console rather than within a Jupyter notebook.

So to run a Python script from the command line, we just use the 

```bash
$ python mathtools.py
```

We can demonstrate this in the console now...

One aspect of this is the script will dump you back on the console as soon as the script ends. Sometimes it is much better to stay in the console and perhaps examine some of the variables or data that was created during the script. 

To do this, you can run scripts using `-i`:

```bash
$ python -i mathtools.py
```

Let's demo this now in the console also.

By using dir() in the console, we see that we have some variables declared, and we can take a look at them. 

This is especially useful if an error occurs. Even after an error occurs, you are not dumped back in to the console. You can inspect the variables and perhaps find the cause of your error.

Use `exit()` to finally exit your script and return to the console.

---

# NumPy

What is Numpy? You can think of it like an extension for Python which adds support for large matrix and array manipulation. It also provides a large amount of functions to work with these arrays, which we will see shortly. If you know MATLAB, then Numpy will seem very familiar to you. As opposed to Pandas, which deals with Excel style data, tabular data and so on, Numpy deals with numerical data only.

## Arrays and Matrices

So what are arrays and matrices?

An array is just a list of numbers. You might also know it has a vector. 

For example:

```python
a = [1, 2, 3, 5, ..., n]
```

Matrices can be 2D or multi-dimensional. They are also called **axes** in Numpy parlance.

So a standard array is 1-dimensional, and is basically a list like the list we saw above.

A 2D array is like a table (you can think of it as a list of 1D arrays), and is synonymous with a matrix in mathematics.

A 3D array has a depth dimension.  You can kind of think of it as a list of 2D arrays, and can be considered a tensor in mathematics.

![numpy](./img/numpy-matrices-crop.png)

*Source*: <https://jovian.com/anujadp4/python-numerical-computing-with-numpy>

Numpy is n-dimensional, so you can go to even further dimensions, for example a 4D array is a list of 3D arrays, and so on. After about 4 dimensions it becomes a bit difficult to parse in termsof the syntax, which we will see momentarily.

A lot of data in your machine learning tasks are going to be either supplied to you as Numpy data, or you will manipulate this data  using Numpy, or the software or algorithms will require input as Numpy data. Therefore, it is an essential skill.

Let's start with importing Numpy:

In [64]:
# It is convention to import Numpy as np
import numpy as np
from pprint import pprint

Let's create a 1D array:

In [65]:
a = np.array([1,2,3,4,5,6,8,9,10])

In [66]:
a

array([ 1,  2,  3,  4,  5,  6,  8,  9, 10])

Let's look at its type:

In [67]:
type(a)

numpy.ndarray

You can see it is a Numpy *n*-dimensional array (`ndarray`). In our case, 1D array.

You can verify this using the `ndim` property:

In [68]:
a.ndim

1

Arrays can be indexed using a zero-based index, which you can do with normal Python lists, however indexing is much more capable which we will see later.

Simple index:

In [69]:
a[0]

1

The last element can be accessed using `-1`:

In [70]:
a[-1]

10

Or the 3rd from last element:

In [71]:
a[-3]

8

So far we have seen nothing that you cannot do with normal lists.

However, if you inspect the list of available functions, you will soon see that Numpy arrrays are much more powerful than standard Python lists. 

We can see a list of functions available using intellisense:

In [73]:
a.all()

True

Or indeed, we can view the documentation as we have done previously. Documentation is not just for Python built in functions, but also for 

In [74]:
np.array?

[0;31mDocstring:[0m
array(object, dtype=None, *, copy=True, order='K', subok=False, ndmin=0,
      like=None)

Create an array.

Parameters
----------
object : array_like
    An array, any object exposing the array interface, an object whose
    ``__array__`` method returns an array, or any (nested) sequence.
    If object is a scalar, a 0-dimensional array containing object is
    returned.
dtype : data-type, optional
    The desired data-type for the array. If not given, NumPy will try to use
    a default ``dtype`` that can represent the values (by applying promotion
    rules when necessary.)
copy : bool, optional
    If true (default), then the object is copied.  Otherwise, a copy will
    only be made if ``__array__`` returns a copy, if obj is a nested
    sequence, or if a copy is needed to satisfy any of the other
    requirements (``dtype``, ``order``, etc.).
order : {'K', 'A', 'C', 'F'}, optional
    Specify the memory layout of the array. If object is not an array, the
   

## Array Broadcasting

Array broadcasting allows you to perform operations over arrays without needing to loop over them.

A simple example illustrates this. We declate a few standard Pytohn lists as follows:

In [75]:
a = [[1, 2, 3],
     [4, 5, 6],
     [7, 8, 9]]

In [76]:
b = [10, 20, 30]

Now let's say I want to add `b` to each row of `a`. 

Let's try to use addition:

In [77]:
a + b

[[1, 2, 3], [4, 5, 6], [7, 8, 9], 10, 20, 30]

That is not what we wanted at all, Python has concatenated the two lists, and we have ended up with a list of 6 elements, where the first 3 elements are lists and the last 3 are the indivdual items from `b`.

So how would we do what we wanted?

We could simply use a `for` loop:

In [78]:
new_matrix = []

for row in a:
    new_row = []
    for element, addition in zip(row, b):
        new_row.append(element + addition)
    new_matrix.append(new_row)

In [79]:
new_matrix

[[11, 22, 33], [14, 25, 36], [17, 28, 39]]

However, we had to write an entire loop, which is error prone and takes a lot of time, etc. 

In Numpy this is much easier. Let's first create Numpy arrays instead of lists:

In [80]:
a = np.array(a)
b = np.array(b)

print(a)
print()
print(b)

[[1 2 3]
 [4 5 6]
 [7 8 9]]

[10 20 30]


And then add them:

In [81]:
a + b

array([[11, 22, 33],
       [14, 25, 36],
       [17, 28, 39]])

Let's now say we had some values in a list, and you wanted to multiply each value with a set of weights. 

Let's declare these are Python lists:

In [82]:
a = [10, 20, 30]

In [83]:
weights = [1.1, 2.2, 3.3]

In [84]:
a * weights

TypeError: can't multiply sequence by non-int of type 'list'

In Python we get an error.

Again, we will have to write a loop:

In [85]:
weight_sums = []

for i in range(3):
    t = a[i] * weights[i]
    weight_sums.append(t)    

Which produces:

In [86]:
weight_sums

[11.0, 44.0, 99.0]

This is what we wanted, but again it required writing this loop, etc.

If we now look at how we could do this with Numpy:

In [87]:
# In this case, we convert the Python list a to a Numpy array and then multiply by the weights
a = np.array([10, 20, 30])
weights = np.array([1.1, 2.2, 3.3])
a * weights 

array([11., 44., 99.])

It is also worth mentioning that if you use array broadcasting, it can be more than 100x faster than using for loops in Python. The underlying Numpy code is parallel, written in C, and is very optimised.

If you are writing any code that uses loops in Python, and find that it is taking a very long time, see if you can somehow perform the same thing using Numpy's matrix operations.... 

### Matrix Multiplications

We can briefly touch on this, however it does not appear in the rest of the seminars so we will only mention it. 

Matric multiiplicaitons are of the form:

![Matrix](./img/matmul.png)

*Image source*: James, Witten, Hastie, Tibshirani. **An Introduction to Statistical Learning**.

In the case of a matrix multiplication, you need to use the `matmul()` function. 

Let's declare two arrays and take a look at them:

In [88]:
A = np.array([[1,2], [3,4]])
B = np.array([[5,6], [7,8]])

print("A =\n", A)
print("B =\n", B)

A =
 [[1 2]
 [3 4]]
B =
 [[5 6]
 [7 8]]


Using `matmul()` we can perform this operation quickly:

In [89]:
np.matmul(A, B)

array([[19, 22],
       [43, 50]])

You can also use the `@` symbol for this.

In [90]:
A @ B

array([[19, 22],
       [43, 50]])

### Array comparison

You can also compare arrays quickly, which can come in very handy for when you are comparing the output of a trained model with the labels, for example.

In [91]:
a = np.array([[1, 2, 3], [3, 4, 5]])
b = np.array([[2, 2, 3], [1, 2, 5]])

In [92]:
a == b

array([[False,  True,  True],
       [False, False,  True]])

And the same is true for `>`, `<`, and `!=`, for example:

In [93]:
a > b

array([[False, False, False],
       [ True,  True, False]])

## Array Slicing

One of NumPy's biggest strengths is the ability to slice arrays in to subsets very easily.

Let's start with some simple slicing of a 1D array:

In [94]:
a = np.arange(1, 11)
a

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [95]:
a[0]

1

In [96]:
a[-1]

10

We have seen this already, but you can also provide a range of items you wish to retrieve:

In [97]:
a[0:3]

array([1, 2, 3])

As is Python convention, `0:3` **includes** element 0 and **excludes** element 3.

You can omit the digit before or after the `:` to mean 'from the start' or 'to the end'.

In [98]:
a[:5]

array([1, 2, 3, 4, 5])

Ranges can also use negative numbers, which we say earlier for selecting the last element. 

Here we go from the last element to the 3rd last element:

In [99]:
a[-5:-1]

array([6, 7, 8, 9])

You can also select items in steps. You place the step size after the range, in the form `start:stop:step`. 

Let's make a larger array for this, and select only every 10th element:

In [100]:
a = np.arange(100)

In [101]:
a[0:100:10]

array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])

Or, using no values for start and end, we can select every 2nd element:

In [102]:
a[::5]

array([ 0,  5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80,
       85, 90, 95])

### 2D Array Slicing

So far so good with 1D arrays. However, we can do advanced slicing on 2D arrays.

Here are a few exmaples we will cover now:

![Slicing](./img/array-slicing.jpg)

Let's first replicate the dataset from the image above:

In [103]:
a = np.arange(36).reshape(6,6)
a

array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35]])

We use the function `reshape()` to create a 2D matrix out of the 1D array that `arange()` returns.

So, with 2D arrays you specify each slice for each dimension using `,`.

The first dimension specifies the rows you want to select, the second dimension specifies the columns you want to select.

So, we can select the first row like this:

In [104]:
a[0,:]

array([0, 1, 2, 3, 4, 5])

Note that we used `:` to specify the entire range, as we have seen before. 

So we said, we wanted row 0, all columns.

If we wanted only the first **column**, then we can use the following:

In [105]:
a[:, 0]

array([ 0,  6, 12, 18, 24, 30])

In this case, we use `:` to say all rows, and `0` to say the first column.

We can also use negative indexing.  

For example we want only the **last two rows**, and only the **first column**:

In [106]:
a[-2:, 0]

array([24, 30])

Take a look at the image above to confirm what we have selected.

Ranges can be used for both axes/dimensions. So let's say we wanted only the centre rows and columns, we can use ranges for both axes:

In [107]:
a[2:4, 1:5]

array([[13, 14, 15, 16],
       [19, 20, 21, 22]])

### 3D Slicing

Of course, 3D slicing is also possible.

We will not demonstrate this very much, but bear in mind that a 3D array in NumPy is like a list of 2D arrays.

When indexing higher dimensional arrays, the first thing to note is the **order of the axes**, which is axis 2, axis 0, axis 1. This might be different to what you have seen before, for example this contrasts with MATLAB, where order of the axes would be z, x, y.

![numpy](./img/numpy-matrices-crop.png)

Note that this is done in NumPy for memory efficiency reasons. It is faster to index this way as you keep data in more continuous blocks in the computer's memory.

First let's make a data structure:

In [108]:
a = np.array([
    np.arange(36).reshape(6, 6),
    np.arange(36, 36+36).reshape(6, 6),
    np.arange(72, 72+36).reshape(6, 6),
])
a

array([[[  0,   1,   2,   3,   4,   5],
        [  6,   7,   8,   9,  10,  11],
        [ 12,  13,  14,  15,  16,  17],
        [ 18,  19,  20,  21,  22,  23],
        [ 24,  25,  26,  27,  28,  29],
        [ 30,  31,  32,  33,  34,  35]],

       [[ 36,  37,  38,  39,  40,  41],
        [ 42,  43,  44,  45,  46,  47],
        [ 48,  49,  50,  51,  52,  53],
        [ 54,  55,  56,  57,  58,  59],
        [ 60,  61,  62,  63,  64,  65],
        [ 66,  67,  68,  69,  70,  71]],

       [[ 72,  73,  74,  75,  76,  77],
        [ 78,  79,  80,  81,  82,  83],
        [ 84,  85,  86,  87,  88,  89],
        [ 90,  91,  92,  93,  94,  95],
        [ 96,  97,  98,  99, 100, 101],
        [102, 103, 104, 105, 106, 107]]])

We can see that the indexing order is different if you look at the `a`'s shape:

In [109]:
np.shape(a)

(3, 6, 6)

Say we want the deepest of the arrays, you would use the first index :

In [110]:
a[-1,:,:]

array([[ 72,  73,  74,  75,  76,  77],
       [ 78,  79,  80,  81,  82,  83],
       [ 84,  85,  86,  87,  88,  89],
       [ 90,  91,  92,  93,  94,  95],
       [ 96,  97,  98,  99, 100, 101],
       [102, 103, 104, 105, 106, 107]])

Likewise, we can say we want all the first rows:

In [111]:
a[:,0,:]

array([[ 0,  1,  2,  3,  4,  5],
       [36, 37, 38, 39, 40, 41],
       [72, 73, 74, 75, 76, 77]])

Or all the first columns:

In [112]:
a[:, :, 0]

array([[  0,   6,  12,  18,  24,  30],
       [ 36,  42,  48,  54,  60,  66],
       [ 72,  78,  84,  90,  96, 102]])

## Searching in NumPy

You will often want to search for particular rows based on conditionals in Numpy, and this can be done using conditionals within `[]` in much the same way as we have used index slicing.

In [113]:
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
a

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

If you want all values greater than a certain value, use the following syntax, in the same way that you use indexing:

In [114]:
a[a > 5]

array([6, 7, 8, 9])

Note that it returns an array of **values**, and loses its shape. The array `a` was a 2D array, and we got back a 1D array of values. This is often what you want, but not always.

If you prefer that the indices are returned instead of the values, use the `where` function, which will maintain the shape of the array:

In [115]:
np.where(a>2)

(array([0, 1, 1, 1, 2, 2, 2]), array([2, 0, 1, 2, 0, 1, 2]))

In this case we are returned the indices as a 2D array, one for each dimention. 

So for example, at index 0,2 the element is greater than 2.

We can see this visually or check this:

In [116]:
a[0,2]

3

Numpy allows you to perform sophisticated searching, far more than we have covered here, but we will leave it at that for now.

## Loading and Reading Data

I will just mention a few helper functions here to load data. The two most commons functions for loading data from the disk are:

- `np.genfromtxt()`
- `np.loadtxt()`

We will not demonstrate this now, as we will use this functionality later anyway to load in data and so on.

### Summary

So these examples cover the most important Numpy functionality that we use for this course. Numpy is a huge framework, and you could spend the entire course covering only Numpy.

During the course of the seminar we will also cover other functionality, and anything new will be highlighted as we go along. 

---

# Pandas

As mentioned, I am aware that Pandas was covered but if a refresher is required we can cover it now before the afternoon session.

Before starting any of our Machine Learning material, we will cover the two most commonly used packages for handling and manipulating data, namely Numpy and Pandas.

You can think of Numpy as an array and matrix manipulation library (if you are familiar with MATLAB, it provides much of the functionality of MATLAB). 

Pandas, on the other hand, provides the functionality of R's data frames, or you could think of it as Excel for Python. It allows you to manipulate data stored in Excel-like spreadsheets, in a programmatic way. Pandas is best suited for this type of **tabular** data.

Therefore, unlike NumPy, Pandas DataFrames have column names for example, and you can access individual columns using the column name, while in NumPy you access columns using indices. 

NumPy is also used for multi-dimensional data or $n$-dimensional arrays, as we saw previously, while Pandas is for tabular data.

Also, Pandas DataFrames are convenient if you get your data from doctors here the LKH or MUG, as you will likeyly receive your data as Excel, and Pandas provides convient ways of importing this data using functions such as `read_excel()`. You can also open SAS, SPSS, and SQL databases.

In [117]:
import pandas as pd  # Convention 

# See https://www.dataquest.io/blog/excel-and-pandas/
movies = pd.read_excel('./data/movies.xls')

In [118]:
movies

Unnamed: 0,Title,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,...,Facebook Likes - Actor 1,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score
0,Intolerance: Love's Struggle Throughout the Ages,1916,Drama|History|War,,USA,Not Rated,123,1.33,385907.0,,...,436,22,9.0,481,691,1,10718,88,69.0,8.0
1,Over the Hill to the Poorhouse,1920,Crime|Drama,,USA,,110,1.33,100000.0,3000000.0,...,2,2,0.0,4,0,1,5,1,1.0,4.8
2,The Big Parade,1925,Drama|Romance|War,,USA,Not Rated,151,1.33,245000.0,,...,81,12,6.0,108,226,0,4849,45,48.0,8.3
3,Metropolis,1927,Drama|Sci-Fi,German,Germany,Not Rated,145,1.33,6000000.0,26435.0,...,136,23,18.0,203,12000,1,111841,413,260.0,8.3
4,Pandora's Box,1929,Crime|Drama|Romance,German,Germany,Not Rated,110,1.33,,9950.0,...,426,20,3.0,455,926,1,7431,84,71.0,8.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1333,Twin Falls Idaho,1999,Drama,English,USA,R,111,1.85,500000.0,985341.0,...,980,505,482.0,3166,180,0,3479,87,54.0,7.3
1334,Universal Soldier: The Return,1999,Action|Sci-Fi,English,USA,R,83,1.85,24000000.0,10431220.0,...,2000,577,485.0,4024,401,0,24216,162,75.0,4.1
1335,Varsity Blues,1999,Comedy|Drama|Romance|Sport,English,USA,R,106,1.85,16000000.0,52885587.0,...,23000,255,35.0,23369,0,0,35312,267,67.0,6.4
1336,Wild Wild West,1999,Action|Comedy|Sci-Fi|Western,English,USA,PG-13,106,1.85,170000000.0,113745408.0,...,10000,4000,582.0,15870,0,2,129601,648,85.0,4.8


If we were to look at the Excel file, we would see it has in fact got multiple sheets. 

Be default Pandas will open the first sheet, 0. Sheets are 0-indexed. You can specify the sheet you would wish to import using `sheetname`, so as follows:

In [119]:
movies_2000s = pd.read_excel('./data/movies.xls', sheet_name=1)

If you know the name of the sheet, this is equivalent:

In [120]:
movies_2000s = pd.read_excel('./data/movies.xls', sheet_name='2000s')

In [121]:
movies_2000s.head()

Unnamed: 0,Title,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,...,Facebook Likes - Actor 1,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score
0,102 Dalmatians,2000,Adventure|Comedy|Family,English,USA,G,100.0,1.85,85000000.0,66941559.0,...,2000.0,795.0,439.0,4182,372,1,26413,77.0,84.0,4.8
1,28 Days,2000,Comedy|Drama,English,USA,PG-13,103.0,1.37,43000000.0,37035515.0,...,12000.0,10000.0,664.0,23864,0,1,34597,194.0,116.0,6.0
2,3 Strikes,2000,Comedy,English,USA,R,82.0,1.85,6000000.0,9821335.0,...,939.0,706.0,585.0,3354,118,1,1415,10.0,22.0,4.0
3,Aberdeen,2000,Drama,English,UK,,106.0,1.85,6500000.0,64148.0,...,844.0,2.0,0.0,846,260,0,2601,35.0,28.0,7.3
4,All the Pretty Horses,2000,Drama|Romance|Western,English,USA,PG-13,220.0,2.35,57000000.0,15527125.0,...,13000.0,861.0,820.0,15006,652,2,11388,183.0,85.0,5.8


The first columns contains an automatically created index, if we wanted we could use the film title as an index instead:

In [122]:
movies_2000s = pd.read_excel('./data/movies.xls', sheet_name=1, index_col=0)
movies_2000s.head()

Unnamed: 0_level_0,Year,Genres,Language,Country,Content Rating,Duration,Aspect Ratio,Budget,Gross Earnings,Director,...,Facebook Likes - Actor 1,Facebook Likes - Actor 2,Facebook Likes - Actor 3,Facebook Likes - cast Total,Facebook likes - Movie,Facenumber in posters,User Votes,Reviews by Users,Reviews by Crtiics,IMDB Score
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
102 Dalmatians,2000,Adventure|Comedy|Family,English,USA,G,100.0,1.85,85000000.0,66941559.0,Kevin Lima,...,2000.0,795.0,439.0,4182,372,1,26413,77.0,84.0,4.8
28 Days,2000,Comedy|Drama,English,USA,PG-13,103.0,1.37,43000000.0,37035515.0,Betty Thomas,...,12000.0,10000.0,664.0,23864,0,1,34597,194.0,116.0,6.0
3 Strikes,2000,Comedy,English,USA,R,82.0,1.85,6000000.0,9821335.0,DJ Pooh,...,939.0,706.0,585.0,3354,118,1,1415,10.0,22.0,4.0
Aberdeen,2000,Drama,English,UK,,106.0,1.85,6500000.0,64148.0,Hans Petter Moland,...,844.0,2.0,0.0,846,260,0,2601,35.0,28.0,7.3
All the Pretty Horses,2000,Drama|Romance|Western,English,USA,PG-13,220.0,2.35,57000000.0,15527125.0,Billy Bob Thornton,...,13000.0,861.0,820.0,15006,652,2,11388,183.0,85.0,5.8


### Iris

Let's move to another dataset, namely the Iris dataset:

In [123]:
iris_df = pd.read_csv('data/iris.csv', sep=',', header=None)

Jupyter formats your data nicely if you just execute the variable name:

In [124]:
iris_df

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


You can use `print()` if tou prefer to see the data in its unformatted state:

In [125]:
print(iris_df)

       0    1    2    3               4
0    5.1  3.5  1.4  0.2     Iris-setosa
1    4.9  3.0  1.4  0.2     Iris-setosa
2    4.7  3.2  1.3  0.2     Iris-setosa
3    4.6  3.1  1.5  0.2     Iris-setosa
4    5.0  3.6  1.4  0.2     Iris-setosa
..   ...  ...  ...  ...             ...
145  6.7  3.0  5.2  2.3  Iris-virginica
146  6.3  2.5  5.0  1.9  Iris-virginica
147  6.5  3.0  5.2  2.0  Iris-virginica
148  6.2  3.4  5.4  2.3  Iris-virginica
149  5.9  3.0  5.1  1.8  Iris-virginica

[150 rows x 5 columns]


The `head()` and `tail()` functions are useful for previewing your data:

We mentioned previously that Pandas is similar to Excel or R's data frames: here we will add a column names to the dataset:

In [126]:
iris_df.columns = ["sepal_length", 'sepal_width', 'petal_length', 'petal_width', 'class']

In [127]:
iris_df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


You can use the `describe()` function to get a quick look overview of some of the statistics of the data, such as the mean value per column:

In [128]:
iris_df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


Note that the class column is not included as it does not contain numerical data.

### Slicing Data

You can select rows and columns based on parameters using Pandas easily. For example if you only wanted the 'sepal length (cm)' column:

In [129]:
iris_df['sepal_length']

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64

If your column name does not contain any spaces or special characters, you can use the dot (`.`) notation to select columns:

In [130]:
iris_df.sepal_length

0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: sepal_length, Length: 150, dtype: float64

You can also select multiple columns by passing a list:

In [131]:
iris_df[['petal_length', 'class']]

Unnamed: 0,petal_length,class
0,1.4,Iris-setosa
1,1.4,Iris-setosa
2,1.3,Iris-setosa
3,1.5,Iris-setosa
4,1.4,Iris-setosa
...,...,...
145,5.2,Iris-virginica
146,5.0,Iris-virginica
147,5.2,Iris-virginica
148,5.4,Iris-virginica


If you know the index, which is a zero-based numerical list of all rows, then use the `loc`:

In [132]:
iris_df.loc[0]

sepal_length            5.1
sepal_width             3.5
petal_length            1.4
petal_width             0.2
class           Iris-setosa
Name: 0, dtype: object

You can select multiple rows:

In [133]:
iris_df.loc[[1,2,3]]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa


Or a range:

In [134]:
iris_df.loc[0:10]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


And combine this with a list of column names, as follows:

In [135]:
iris_df.loc[0:10, ['petal_length', 'class']]

Unnamed: 0,petal_length,class
0,1.4,Iris-setosa
1,1.4,Iris-setosa
2,1.3,Iris-setosa
3,1.5,Iris-setosa
4,1.4,Iris-setosa
5,1.7,Iris-setosa
6,1.4,Iris-setosa
7,1.5,Iris-setosa
8,1.4,Iris-setosa
9,1.5,Iris-setosa


In [136]:
import numpy as np
iris_df.iloc[np.r_[2:5, 7:10]]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


A common task is to sample from your dataset based on conditions. For example, you may want all rows where the sepal length is greater than 7cm. 

If you run the following:

In [137]:
condition = iris_df.sepal_length > 7

you will get a list back with True/False masks:

In [138]:
condition

0      False
1      False
2      False
3      False
4      False
       ...  
145    False
146    False
147    False
148    False
149    False
Name: sepal_length, Length: 150, dtype: bool

In [139]:
iris_df[condition]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
102,7.1,3.0,5.9,2.1,Iris-virginica
105,7.6,3.0,6.6,2.1,Iris-virginica
107,7.3,2.9,6.3,1.8,Iris-virginica
109,7.2,3.6,6.1,2.5,Iris-virginica
117,7.7,3.8,6.7,2.2,Iris-virginica
118,7.7,2.6,6.9,2.3,Iris-virginica
122,7.7,2.8,6.7,2.0,Iris-virginica
125,7.2,3.2,6.0,1.8,Iris-virginica
129,7.2,3.0,5.8,1.6,Iris-virginica
130,7.4,2.8,6.1,1.9,Iris-virginica


In [140]:
iris_df.loc[condition, ['sepal_length', 'class']]

Unnamed: 0,sepal_length,class
102,7.1,Iris-virginica
105,7.6,Iris-virginica
107,7.3,Iris-virginica
109,7.2,Iris-virginica
117,7.7,Iris-virginica
118,7.7,Iris-virginica
122,7.7,Iris-virginica
125,7.2,Iris-virginica
129,7.2,Iris-virginica
130,7.4,Iris-virginica


You can make even more conditions:

In [141]:
condition_multiple = (iris_df['sepal_length'] > 5) & (iris_df['petal_length'] > 6)

In [142]:
iris_df[condition_multiple]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
105,7.6,3.0,6.6,2.1,Iris-virginica
107,7.3,2.9,6.3,1.8,Iris-virginica
109,7.2,3.6,6.1,2.5,Iris-virginica
117,7.7,3.8,6.7,2.2,Iris-virginica
118,7.7,2.6,6.9,2.3,Iris-virginica
122,7.7,2.8,6.7,2.0,Iris-virginica
130,7.4,2.8,6.1,1.9,Iris-virginica
131,7.9,3.8,6.4,2.0,Iris-virginica
135,7.7,3.0,6.1,2.3,Iris-virginica


In [143]:
iris_df.loc[condition_multiple, ['sepal_length', 'petal_length', 'class']]

Unnamed: 0,sepal_length,petal_length,class
105,7.6,6.6,Iris-virginica
107,7.3,6.3,Iris-virginica
109,7.2,6.1,Iris-virginica
117,7.7,6.7,Iris-virginica
118,7.7,6.9,Iris-virginica
122,7.7,6.7,Iris-virginica
130,7.4,6.1,Iris-virginica
131,7.9,6.4,Iris-virginica
135,7.7,6.1,Iris-virginica


--- 

# End of Morning Session

We will continue after a break