# Day 2

© 2026, Marcus D. Bloice, licensed under <a href="https://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a><img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg" alt="" style="max-width: 1em;max-height:1em;margin-left: .2em;"><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg" alt="" style="max-width: 1em;max-height:1em;margin-left: .2em;"><img src="https://mirrors.creativecommons.org/presskit/icons/sa.svg" alt="" style="max-width: 1em;max-height:1em;margin-left: .2em;">

## Topics

Today we will cover the following topics:

- The Pandas Library
- The NumPy Library
- SQL Databases
- Plotting
- Machine Learning
- Assignment

For data analysis, scientific programming, and machine learning the essential tools that you'll use are Pandas and NumPy. For plotting the most commonly used package is matplotlib, which we will cover also.

---

# Data Science Basics: Pandas and NumPy

Now we will cover the basics of Data Science, as this is likely an area that is of interest to many of you attending this course. 

We will cover a few distinct packages in order to learn the basics of Data Science, namely:

- Pandas
- NumPy
- MatPlotLib
- SciKit-Learn
- SQLite

## Overview

Let's first discuss each of these packages. 

### Pandas

Pandas (≈ Python Data Analysis) is a data manipulation framework for Python. You can think if it as Excel for Python or Python's DataFrames from R. 

Pandas makes it easy to create and manipulate tabular data, and can read many file formats including Excel and CSV files, but also SPSS files and other formats.

### NumPy

NumPy is another Data Science toolkit that is used frequently. 

You can think of NumPy as a more lower-level toolkit for manipulating tabular data, and is more like MATLAB for Python.

### MatPlotLib

The most popular plotting library for Python is MatPlotLib, and we will cover the basics of it in this course. Plotting is more or less essential for exploratory data analysis. 

### SQLite and Databases

If you are working with very large data, split across multiple tables and reltionships, you may well encounter data in database form. 

SQLite is one such database software that is used frequently.

### SciKit-Learn

SciKit-Learn is a Python package for Machine Learning. 

We will not cover much Machine Learning in this course, except for some basics. If you want to learn more about Machine Learning and Deep Learning, then take a look at our Advanced Topics in Scientific Programming course, more details can be found on my homepage: <https://user.medunigraz.at/marcus.bloice/>

However, we will quickly cover some very basic algorimths such as Logistic Regression, so that you can see how SciKit-Learn works with data from packages such as NumPy or Pandas.

# Pandas

As mentioned previously, Pandas is a library for tabular data. It provides the ability to analyse spreadsheet-like data in tabular form, not unlike Excel, or R's Data Frames.

Python Pandas is used for data manipulation and analysis. Core purposes:

- Structured data handling: Work with tabular and labeled data using DataFrame and Series.
- Data loading and saving: Read/write CSV, Excel, JSON, SQL databases, Parquet, etc.
- Data cleaning: Handle missing values, filter rows, select columns, convert data types, rename, reorder.
- Data transformation: Grouping, aggregation, reshaping (pivot, melt), joining/merging datasets, sorting.
- Exploratory data analysis: Summary statistics, distributions, correlations, quick inspection of datasets.
- Time series analysis: Date/time indexing, resampling, rolling windows.
- Feeds directly into NumPy, matplotlib, and scikit-learn workflows.

In short, Pandas is the standard tool in Python for turning raw data into clean, analysable structures suitable for statistics, visualisation, and machine learning.

In the machine learning or data analysis workflow what you will often find is that you will tend to load and clean and explore your data with Pandas first, and then use NumPy for the final stages of the work. We will look at NumPy later.

To import Pandas, we will do as follows:

In [None]:
import pandas as pd

We use here the `pd` convention—you will see this online all the time, and has become standard convetion for when importing Pandas, in order to save us some keystrokes.

## Introduction

Before we get to importing some data, let's create a small dataset so that we can test some of the functionality that is built in to Pandas.

Let's start with the absolute basics: Pandas data structures are based on Series and Data Frames. 

- Series: 1-dimensional, like a list 
- Data Frame: 2-dimensional, tabular style data, like a spreadsheet

We will concentrate almost entirely on 2D Data Frames, however first we will look at how to create a 1D Series:

In [None]:
series = pd.Series([10, 20, 30])

In [None]:
series

In [None]:
type(series)

Unlike a standard Python list, a series can have an index, which can be non-numeric. 

In [None]:
series = pd.Series([10, 20, 30], index=['Value 1', 'Value 2', 'Value 3'])
series

However, mostly you will be working with 2D, tabular data. For this we use a Data Frame. 

To do this, you create a new data set using the `DataFrame()` function. You can do this by passing some data to it and you will get a new Data Frame out.

In [None]:
patients = pd.DataFrame({
    'age': [25, 32, 41],
    'height': [175, 180, 168]
})

We create a 2D data frame **by column** - each column is defined using a list. Notice how we are also able to name these columns:

In [None]:
patients

So Data Frames 

You can see we have got 2 columns, named `age` and `height` and 3 rows, which has automaticallly been given a numerical index.

A numerical, 0-based index is often exactly what you want, however we may want to have a custom index. We do this as follows:

In [None]:
patients = pd.DataFrame({
    'age': [25, 32, 41],
    'height': [175, 180, 168]}, 
    index = ['patient_id_1', 'patient_id_2', 'patient_id_3'])

patients

Ok let's now create a Data Frame with a few columns, with various types of data.

You may have noticed, that we passed our data to the `DataFrame()` function as a dictionary, that is how we were able to name each of the columns. 

Therefore, if you have you data as a dictionary, you can just pass this to `DataFrame()` and it will create a Data Frame:

In [None]:
data = {
    "patient_id": ["P001", "P002", "P003", "P004"],
    "age": [45, 62, 51, 38],
    "sex": ["F", "M", "F", "M"],
    "hba1c": [6.1, 7.4, 5.9, 8.2]
}

In [None]:
data

Now, we pass the `data` dictionary to the `DataFrame()` and store the Data Frame in a variable called `df`:

In [None]:
patients = pd.DataFrame(data)

In [None]:
patients

As you can see, Jupyter does a good job previewing our data. However, the dataset above is very small. 

## Inspecting a Data Frame

If you are dealing with very large datasets, you do not neccessarily want them to clog up your notebook with hundreds of rows. For that there are a few functions that can help us preview the data and get an overview of your data.

A few useful functions can be used, such as:

- `head()` / `tail()`
- `info()`
- `describe()`

Let's try them:

In [None]:
patients.head(2)

In [None]:
patients.tail(2)

This normally only shows the first `n` rows, which defaults to `n=10`. This is useful to preview large datasets. In this case, as we only have 3 rows, it shows the entire database. The equivalent function is `tail(n)` which shows the last `n` rows of the Data Frame. 

Now try:

In [None]:
patients.info()

Here we see various details about each column's type for example. We see that `age` is of type `int64`. While `hba1c` is of type `float64`. These types have been automatically inferred by Pandas when we first created the Data Frame.

These types can of course be user-defined, which we won't cover.

Last, we can try:

In [None]:
patients.describe()

Here we see some statistics regarding each of the numerical columns. Notice that for example the `patient_id` column doesn't appear here as this column is not numeric, and therefore values such as standard deviation would make no sense. So only suitable columns are shown when `desribe()` is used.

To examine the number of rows and columns, there are a few properties we can examine:

In [None]:
print(patients.shape)
print(patients.columns)
print(patients.index)

## Saving Data

Later, we will see how we can open CSV files, Excel files, and so on. 

For now, let's see how you can save your data. 

In [None]:
patients.to_csv("out.csv", index=False)
patients.to_excel("out.xlsx", index=False)

We can preview the CSV file directly with Jupyter, to view the Excel file, we need to download it first. 

## Column Access

Ok so we have our data in the Data Frame called `patients`, how do we access columns, rows, etc?

Just as in lists, we use square brackets (`[`, `]`) for this.

For example, let's say we wanted the `age` column and nothing else:

In [None]:
patients['age']

This selects the `age` column, which can then be used further. **Note** that single columns are in fact Series.

By accessing a single column, we can perform operations on the columns, such as getting the mean:

In [None]:
from statistics import mean

mean(patients['age'])

Multiple columns can also be accessed:

In [None]:
patients[['age', 'hba1c']]

This time we have passed a list of column names, `['age', 'hba1c']` as an index.

Another way to access individual columns is with the `.` notation:

In [None]:
patients['hba1c']

In [None]:
patients.hba1c

**Note**: for this to work, the column name cannot contain a space. If you want to use this shorthand notation, ensure you name your columns accordingly! 

## Creating Columns

Columns can be added using assignment:

In [None]:
patients['height'] = [167, 175, 165, 190]
patients

Here we have created a `height` column and assigned it its values using a list. 

Let's also make a column for the patients' weight:

In [None]:
patients['weight'] = [85, 70, 75, 90]

You can get creative with this, for example you can create a new row based on the data from another row or rows:

In [None]:
patients

In [None]:
patients['bmi'] = patients['weight'] / (patients['height'] / 100) ** 2
patients

## Row Access

To access individual individual rows and columns, use the `loc` indexer:

In [None]:
patients.loc[0]

As you can see, we have recieved the row labelled `0` back. 

If we only want a specfic column, we can do that:

In [None]:
patients.loc[0, 'sex']

Or more column names:

In [None]:
patients.loc[0, ['sex', 'hba1c']]

Therefore, `loc` expects you to provide which rows you want, followed by which columns (you will get all back by default).

And we can of course use ranges for our indexing:

In [None]:
patients.loc[0:2]

Notice that `loc` is start and stop **inclusive**.

You can use `:` as shorthand for "all rows":

In [None]:
patients.loc[:, ['age', 'height']]

Notice that we passed the columns we wanted using their names. If you want purely index based access, use `iloc`: 

In [None]:
patients.iloc[:2, :]

Or:

In [None]:
patients.iloc[:, 2:4]

Note that `iloc` is start **inslusive** and stop **exclusive**!

- You can think of `iloc` as being numeric based access **only**. 
- If you want to be able to select rows and columns using their labels, use `loc`.

## Dropping Columns

In [None]:
patients = patients.drop('patient_id', axis=1)
patients

Rows can be dropped by changing the axis to `

- Axis 1: always refers to columns
- Axis 0: always refers to rows

## Conditional Access

Pandas lets you use logic in order to select rows. 

For example, we want to selec tonly patients with a BMI of over 25:

In [None]:
patients['bmi'] > 25

What we get back is actually a index of true/false values depending on whether the condition was met.

To get the data itself, we can do the following:

In [None]:
patients[patients['bmi'] > 25]

Here, we have accessed the rows of patients based on the condition. 

We can create a new column based on this condition. We saw above we can create a new column using assignment, we can create a new `diabetes` column based on the result of the conditional:

In [None]:
patients['diabetes'] = patients['hba1c'] > 6
patients

### Advanced Indexing with `loc`

You can do even more advanced indexing using `loc[]`, such as:

In [None]:
patients.loc[
    (patients["sex"] == "F") & (patients["hba1c"] > 5.0),  # Here is our index for loc
    ["age", "hba1c", 'diabetes']                           # Here are the columns we want  
]

Using `loc` we can first provide the index logic, so for example that want `sex` to be `F` **and** `hba1c` to be greater than 5.0. Next, we provide the columns we want to select. In the example above we said we wanted only the `age` and `hba1c` columns.

We could easily leave out the column index:

In [None]:
patients.loc[
    (patients["sex"] == "F") & (patients["hba1c"] > 5.0),
]

## Handling Missing Data

Pandas has useful tools for finding and handling missing data. Real world datasets that you will work on often have missing data. 

Many data analysis algorithms will actually fail on data with missing values, so missing data almost always needs to be addressed.

Let's add a column with some missing data so we can see its features:

In [None]:
patients

In [None]:
import numpy as np

patients['lab_value'] = [10.1, 11.1, np.nan, 9.9]

In [None]:
patients

Here we used a `np.nan` to create what is known as "not a number": NaN. 

NaN is a special value that corresponds to missing data. It is not written as 0, as this can be interpreted incorrectly. 

When importing data, Pandas will use NaN for values that are missing. 

Now we can search for missing data:

In [None]:
patients.isna()

Here we get a table back of `True`/`False` values, where we see our missing data. This is not much use on its own 

We can get an easier representation to read using `sum()`:

In [None]:
patients.isna().sum()

If we sum this, then we get all missing values in the dataset:

In [None]:
patients.isna().sum().sum()

How do we handle missing data? 

We can do something very rudimentary like saying:

In [None]:
patients.fillna(0)

This uses the function `fillna()` to simply replace every value with `0`.

In [None]:
patients.dropna()

In [None]:
patients

The row with the missing data has now been removed. 

While this works, in the sense that algorithms that cannot handle missing data will not fail, but it is not very sophisticated.

We can replace the row with the missing data with the mean of the column, for example:

In [None]:
patients['lab_value'] = patients['lab_value'].fillna(patients['lab_value'].mean())
patients

Now we have a reasonable value in place of our missing data. Note that mean is not the only option, you could take the median, mode (for categorical data), or even use some kind of regression.

## Copies

You may have noticed something about the code snippets above. 

When we ran the code: 

```python
patients.dropna()
```

It dropped the row with the missing value, but if you were to look at `patients` again, the row would still be there!

This is because many functions in Pandas will return a **copy** of your data. This is more or less a safety feature. 

What you need to do with functions that return a copy is something like the following:

```python
patients = patients.dropna()
```

or 

```python
patients_no_nans = patients.dropna()
```

Now you have your original data, as it looked before `dropna()` was called, and a new Data Frame called `patients_no_nans` with the dropped row. 

Just be aware of this, as **many**, if not **most** Pandas functions returns copies of the data.

## Grouping/Aggregation 

Grouping data is a useful technique you may know from Excel, etc. and this can be done using `groupby()` in Pandas.

For example:

In [None]:
patients.groupby('sex')['age'].mean()

What is happening here is the following:

1. `df.groupby('sex')`
    - Splits the DataFrame into groups based on unique values in column `sex`
    - In our case, we have 'M' and 'F'
2. `['age']`
    - Selects one column (`age`) from each group 
3. `.mean()`
    - Computes the mean of age within each group 

In English: For each sex, compute the average age.

Or combine it with the aggregate function `agg()`:

In [None]:
patients.groupby('sex').agg({
    'age': 'mean',
    'height': ['mean', 'std']
})

So what is happening here:

1. `df.groupby('sex')`
    - Same grouping as before
2. `.agg({...})`
    - Applies different aggregation functions to different columns
    - This is passed as a dictionary:

```python
{
  'age': 'mean',
  'height': ['mean', 'std']
}
```

- So for `age` compute the mean. 
- For `height` compute the mean and standard deviation. 

## Raw Data

If at any time you need your data as just standard lists, then you can use the `values` property:

In [None]:
patients.values

These are just standard Python lists and can be used by any other Python code that handles lists.

## Orient

Pandas has a function called `to_dict()`, which converts Data Frames to dictionaries. However, it contains an `orient` parameter which can be very useful for exporting your data, and also for looping through your data. 

Let's again use our `patients` dataset, convert it to a dictionary:

In [None]:
patients.to_dict()

You see that for each column, we have an entry, for exmaple the `age` has been exported with the `45`, `62`, `51`, etc, followed by `sex` contains `F`, `M`, `F`, and so on.

By default, `to_dict()` defaults to using `orient="dict"`, meaning each column's values has been exported as a dictionary. 


Let's change this to `list` and observe what happens:

In [None]:
patients.to_dict(orient='list')

What can be seen is that each column in `patients` has now been exported as a list.

Both `list` and `dict` return column oriented data.

However, if you wanted to return row-based data, you can use `record`:

In [None]:
patients.to_dict(orient='records')

Now, each row is returned as a dictionary. 

This makes it ideal for iterating over your patients. 

For example, we can using `for` to go over each `patient` in `patients`:

In [None]:
for patient in patients.to_dict(orient='records'):
    print(f'Age: {patient["age"]} BMI: {patient["bmi"]} Diabetes: {patient["diabetes"]}')

Using `to_dict()` in combination with `orient="records"` and a `for` loop can be a very convenient way to go over your entire dataset row by row.

## Built-in Plotting

Pandas has some built-in plotting functionality, which can be convenient for exploring data quickly. 

The built-in functionality is not particularly powerful, but it can quickly output distributions and so on. The plots would not be considered publication level. For this we will cover dedicated plotting libraries later. 

For now, let's just plot the Data Frame as is, and see what we get out.

In [None]:
patients.hba1c.plot()

It takes some parameters, such as the type of plot, and the $y$-axis label:

In [None]:
patients.hba1c.plot(kind='bar', ylabel='HbA1c (%)')

Two columns can be plotted against each other:

In [None]:
patients.plot(kind='scatter', x='age', y='hba1c')

Or a histogram which are binned:

In [None]:
patients.plot?

In [None]:
patients["hba1c"].plot(kind="hist", bins=8, xlabel='HbA1c (%)')

Generally speaking, you will use a dedicated plotting library to make plots, but it can be very useful to plot something quickly using Panda's built in plotting functionality. 



## Exercise 1

Your task is to analyse a tips dataset. 

> The "tips" dataset is a popular dataset often used for demonstration and practice in data analysis and visualisation. It contains information about various attributes of customers in a restaurant, including the total bill amount, tip amount, gender, whether the customer smokes or not, the day of the week, time of day, and the size of the party.
> 
> From: <https://www.kaggle.com/datasets/sakshisatre/tips-dataset>


Below we load the data as a Pandas DataFrame:

In [None]:
import seaborn as sns
tips = sns.load_dataset("tips")

In [None]:
tips

Use Pandas to:

- Show the first 5 rows
- Show the last 5 rows
- Display how many rows and columns the Data Frame has

In [None]:
# Answer here

tips.head(10)


In [None]:
tips.tail()

In [None]:
tips

In [None]:
tips.shape

Next:

- Extract the `total_bill` column and save it to a variable `total_bill_col`
- Select `total_bill` and `tip` together as a DataFrame called `total_bill_and_tip`

In [None]:
# Answer here

total_bill_col = tips['total_bill']

total_bill_and_tip = tips[['total_bill', 'tip']]


In [None]:
total_bill_and_tip

Now:

- Select all rows where `total_bill` is greater than 20.
- Select all rows where `day` is "Sun".
- Select all rows where `time` is "Dinner" **and** `total_bill` > 20.

In [None]:
# Answer here

# Tips greater than 20
tips[tips['total_bill'] > 20]

# Tips on Sunday
tips[tips['day'] == 'Sun']

tips[(tips['time'] == 'Dinner') & (tips['total_bill'] > 20)]

Finally:

- Create a column `tip_ratio` defined as `tip` / `total_bill`
- Display the first 5 rows of the updated DataFrame

In [None]:
# Answer here

tips['tips_ratio'] = tips['tip'] / tips['total_bill']

In [None]:
tips.head(5)

---

# NumPy

If you are doing any kind of data science, you will eventually come across the NumPy library. 

NumPy is a library for manipulating arrays, which are like Python lists but far more advanced. NumPy is often used when working with large numeric datasets also because of its speed as it is optimised to perform operations very fast across your data.  

It differs from Pandas in some fundamental ways. In many ways it is lower level than Pandas. It does not allow for custom indexes, or headers, and NumPy arrays cannot contain anything except numeric data. 

However, Pandas and NumPy compliment eachother, and you will often find that you will use a combination of both Pandas and Numpy: for example, you may perform some exploratory work using Pandas and then use NumPy to perform more compute intensive operations on your data.

Because NumPy is quite low level, we will not spend as much time on it as Pandas, the advanced course goes in to much more detail regarding NumPy if you want to go in to more depth.

## Basics of NumPy

### Creating Arrays

Fundamental to NumPy is the array data structure. 1-dimensional arrays are basically the equivalent of Python lists. 2-dimensional arrays can be considered like tables. 3D arrays are generally used to store images --- images often have several colour channels where each colour channel is one 2D array: so in in other words a 2D array for red, a 2D array for blue, and a 2D array for green, stored as one 3D array.

![NumPy-Martrics](Images/numpy-matrices.png)

However, in this course we will only look at 1D and 2D arrays. In the advanced course we cover 3D arrays, and even 4D arrays for example.

There are various ways in which you can create arrays using NumPy. They can be created from standard Python lists, imported from CSV files, created from Pandas Data Frames, or generated with random numbers. 

Let us first create a 1D array from a standard Python list:

In [None]:
import numpy as np  # Convention is to import as np

# Standard Python list
l = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

# Use array() function to create arrays
l_np = np.array(l)

This creates an NumPy array, `l_np`, from the list `l`.

Use `type()` to show this:

In [None]:
type(l)

In [None]:
type(l_np)

Printing a NumPy array also looks slightly different:

In [None]:
l

In [None]:
l_np

### The `arange()` Function

As well as converting from lists, there are a few utility functions for creating arrays, for example `arange` that will create a range of numbers:

In [None]:
np.arange(0, 10)

Again, it is the start index **inclusive** and the stop index **exclusive**!

We can actually omit the start index and just provide a single number:

In [None]:
np.arange(10)

We can specify also specify a step:

In [None]:
np.arange(0, 10, 2)

To specify a step, you need to provide a start index and stop index.

### The `linspace()` Function

The `linspace()` function is used to create a linearly spaced array, where you specify a start value, a stop value, and the number of items you would like in between. 

Let's demonstrate it here:

In [None]:
np.linspace(0, 1, 5)

This creates 5 values spaced linearly between 0 and 1.

Another example:

In [None]:
np.linspace(0, 100, 8)

### Creating 2D Arrays

As we mentioned previously, NumPy can handle 2D, 3D and actually $n$-dimensional arrays. Here we will learn how to create and manipulate 2D arrays. In fact in this course we will not look at arrays beyond 2D, but the advanced course does cover this.  In machine learning, 3D arrays are generally used to store image data for use in deep neural networks, for example. As we do not focus on imaging in this course, we will not cover 3D arrays.

Just as in creating a 1D array, we can create a 2D array with standard Python lists:

In [None]:
l_2d = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 7, 9]]
)

l_2d

Here we have provided NumPy with a list of lists. Each individual list corresponds to a row in our 2D tabular array.

### The `zeros()` Function

If you need to create an empty array with 3 rows and 4 columns, you can use the `zeros()` function:

In [None]:
l_2d = np.zeros((3, 4))

Which creates a $3 \times 4$ array: 

In [None]:
l_2d

Which you can check using the `shape` property as we used with the 1D array above.

In [None]:
l_2d.shape

### 2D Arrays with `arange()`

Or if you want a table with a particular range of values, we can use the `arange()` in combination with `reshape()`.

We saw above that the `arange()` funcion returns a 1D list corresponding to a range of values that you provide, in the form of a 1-dimensional list. Using the `reshape()` function, we can convert this into a tabular 2D array however. 

We do this as follows:

In [None]:
a = np.arange(49)
a = a.reshape((7, 7))
a

Using function chaining (which we mentioned briefly yesterday), we can also just say:

In [None]:
np.arange(49).reshape((7,7))

Of course, to make a table, the shape of the table you want to create must be valid. 

For example, this will fail:

In [None]:
np.arange(50).reshape((7,7))

You cannot create a 7x7 table with 50 elements.

However, this would work:

In [None]:
np.arange(50).reshape((5, 10))

Or:

In [None]:
np.arange(50).reshape((10, 5))

Sometimes it is easier to work backwards: think of a table shape that you want, e.g. 9x9, and then using `arange()` on 81 elements.

### Creating 2D Arrays Using `linspace()` 

Just as with 1D arrays, we can use `linspace()`, but it must also be used in combination with `reshape()`:

In [None]:
np.linspace(0, 1000, 16).reshape((4,4))

Now, the number of linearly spaced items, in this case 16, must be compatible with the shape of the 2D array you want to create ($4 \times 4$).

If we wanted integer values, we could add another method to the end of our chain, namely `astype()`, which allows you to change the type of the values within your array.

We do this as follows:

In [None]:
np.linspace(0, 1000, 16).reshape((4,4)).astype(int)

Here we have passed `int` to the `astype()` function so that we convert our values in the array to whole numbers.

### Creating a 2D Array with Random Values

NumPy has its own set of random number generator functions. 

To create a random number you merely provide it

In [None]:
np.random.randint(10)

This creates a single random integer between 0 and 9. To create a 2D array of random numbers, we provide the `randint()` function with a `size`:

In [None]:
np.random.randint(2, size=(8,8))

### Pandas and NumPy

Pandas and NumPy work interopably when feasible. 

For example, we can create a Pandas Data Frame using a NumPy 2D array as follows:

In [None]:
import pandas as pd
a = np.random.randint(9, size=(3,3))
a_pd = pd.DataFrame(a)
a_pd

You can of course create a header and index:

In [None]:
pd.DataFrame(a, columns=['Patient 1', 'Patient 2', 'Patient 3'])

And provide an index:

In [None]:
pat = pd.DataFrame(a, columns=['Patient 1', 'Patient 2', 'Patient 3'], index=['Value 1', 'Value 2', 'Value 3'])
pat

So, here we use NumPy and Pandas together:

- NumPy provides the raw numerical data
- Pandas adds labels and tabular structure

Likewise, we can export Pandas Data Frames as NumPy arrays. 

While this is possible using `np.array()`, such as the following:

In [None]:
pat_np = np.array(pat)
pat_np

However, it is generally encouraged to use Pandas' built-in `to_numpy()` function to do this, as it is will likely do a better job in exporting the data to NumPy format, especially if you have a complicated dataset. 

This is done as follows:

In [None]:
pat_np = pat.to_numpy()
pat_np

Here we are using the `to_numpy()` function of the Pandas Data Frame, rather than NumPy's `array()` function.

Note again, NumPy is for numerical computation. If your Pandas Data Frame has non-nmerical types, it may not be possible to convert to NumPy, and even if it did, many of the functions in NumPy expect purely numerical datasets and will not work otherwise

## Array Properties

There are a few utility functions to get information about your arrays that are convenient.

The shape property will tell you the number of rows and columns of your array:

In [None]:
l_np.shape

In [None]:
l_2d.shape

The `ndim` property will tell you the number of dimensions of your array. 

In [None]:
l_2d.ndim

In [None]:
l_np.ndim

2D means it is tabular data. 

The size property tells you the total number of elements in the array:

In [None]:
l_np.size

In [None]:
l_2d.size

The `dtype` property will tell you the type of the data contained in your array:

In [None]:
l_2d.dtype

## Array Functions

NumPy arrays have a number of built in functions which can be very useful. 

These inlcude:

- `a.sum()`
- `a.mean()`
- `a.min()`
- `a.max()`
- `a.std()`
- `a.var()`

We will demonstate some of these now. 

So, if you need to sum every element in in your array, you can use `sum()`.

In [None]:
# Let's first create an 2D array
a = np.arange(32).reshape((8,4))
a

In [None]:
a.sum()

The largest value in your array can be found using `max()`:

In [None]:
a.max()

Likewise the minimum:

In [None]:
a.min()

The average value can be found using `mean()`:

In [None]:
a.mean()

### Row and Column Operations

Many of these functions can be applied column-wise or row-wise using the `axis` parameter.

For example, `sum()`:

In [None]:
a.sum(axis=0)   # column-wise

In [None]:
a.sum(axis=1)   # row-wise

Notice that `sum()` now returns multiple elements. When `axis` is set to `0`, the `sum()` function is applied column-wise, and therefore 4 values are returned as our 2D array is a $8 \times 4$ array, and we get one summation for each column of the array.

Likewise, when setting `axis` to `1`, we are saying that we wish to apply the function row-wise, and hence 8 values are returned, one summation for each row.

The same can be done for `mean()`, `min()`, etc.

In [None]:
a.mean(axis=1)  # row wise

In [None]:
a.min(axis=0)  # column wise

### Sorting

You can use the `sort()` to sort either each row, or each column.

Let's make a unsorted array to demonstate this:

In [None]:
b = np.random.randint(100, size=(6,6))
b

Now sort the columns in ascending order:

In [None]:
b.sort(axis=0)  # Column-wise
b

And conversely, we can sort the rows in the ascending order:

In [None]:
b.sort(axis=1)  # Row-wise
b

Again, we specify whether we want columns or rows with the `axis` parameter.

## Indexing and Slicing

Where NumPy can be very useful is the indexing and slicing of your data. 

As with Python lists you can select a subet of a 1D array quite easily:

In [None]:
l = np.arange(10)

In [None]:
l

Or a range can be selected:

In [None]:
l[0:3]

Again the start index is included, while the stop index is not.

If you omit one of the values around the colon `:` it means, "from the begining" or "to the end".

For example:

In [None]:
l[:5]

is the same as saying `l[0:5]`.

Likewise:

In [None]:
l[5:]

We can also specify a step:

In [None]:
l[0:8:2]

Therefore, the indices take the form `start:stop:step`, where `step` is optional.

If tou only inlcude a step, you will get every step's element:

In [None]:
l[::3]

Negative indices from the end of the array to the beginning:

In [None]:
l

In [None]:
l[-5:-2]

This can be slightly confusing, however often you will see this used to get the last $n$ items from a list, e.g. to get the last 3 elements of a list:

In [None]:
l[-3:]

### 2D Array Slicing

2D array slicing is a particular strength of the NumPy library.

Let's first create a 2D array to demonstate this:

In [None]:
a = np.arange(36).reshape((6,6))
a

We can select individual elements using two indices, one index for the row and one index for the column. 

For example:

In [None]:
a[0, 0]

If we wanted the 3rd element in the 3rd row (this is the value `14` in the array above), we'd say:

In [None]:
a[2, 2]

As you can see, to slice 2D arrays, we seperate the indices for the rows and the columns with a `,`.

We first specify the rows, and then the columns, so it takes the form:

```python
array[row_indices, column_indices]
```

### Slicing With Ranges

We can use the colon `:` character to use index ranges, as we saw above.

So, if we want to select only the first row, we can do this:

In [None]:
a[0,:]

What we have done above is to say we want the first row, row 0, and all columns, which can be represented with the colon `:` character.

Let's select only the first column in this case:

In [None]:
a[:,0]

Here we use `:` to say we want all rows, followed by `,` and then we specify we want column `0`, the first column.

We can also use negative indexing, so we might say we want the last two columns:

In [None]:
a[:, -2:]

Or the last row, all columns:

In [None]:
a[-1:, :]

Last, we can also specify a step, as with the 1D arrays above.

So we might say we want every second column:

In [None]:
a[:,::2]

Or every second row:

In [None]:
a[::2, :]

And combine these techniques to some clever indexing:

In [None]:
a[::2, :-2]

Here we have said we want every second row, for all columns except for the last 2.

Slicing can also be performed on 3D arrays and beyond, however we will not cover this during this course.

## Array Broadcasting

One particular feature of NumPy that makes it useful for scientific computing is array boradcasting. 

These are one line expressions where you can do element-wise operations on arrays, without needing to write loops.

Let's demonstrate this with some code. 

Imagine you had two standard Python lists:

In [None]:
a = [10, 20, 30]
weights = [1.1, 2.2, 3.3]

And you wish to multiply the values in `a` with their corresponding values in `weights`.

If we try this with standard Python lists, we will get an error:

In [None]:
a * weights

As we cannot multiply two lists together, we will have to loop over the lists:

In [None]:
weight_sums = []

for i in range(3):
    t = a[i] * weights[i]
    weight_sums.append(t)

weight_sums

However, if we convert those lists to NumPy arrays, we can do this in one line:

In [None]:
a = np.array(a)
weights = np.array(weights)

a * weights

No loops required, NumPy has multiplied, element wise each element of `a` with its corresponding element in `weights`.

The same goes for addition. 

Let's take these standard Python data structures:

In [None]:
a = [[1, 2, 3],
     [4, 5, 6],
     [7, 8, 9]]

b = [10, 20, 30]

If I wanted to add each row of `a` with `b` I could try using addition:

In [None]:
a + b

While it doesn't fail, it has not done what we wanted.

If we do this with NumPy arrays however:

In [None]:
a = np.array(a)
b = np.array(b)

a+b

Again compare this to having to write a loop:

In [None]:
a = [[1, 2, 3],
     [4, 5, 6],
     [7, 8, 9]]

b = [10, 20, 30]

new_matrix = []

for row in a:
    new_row = []
    for element, addition in zip(row, b):
        new_row.append(element + addition)
    new_matrix.append(new_row)

new_matrix

Even something as simple as adding a number to every element in a Python list results in an error:

In [None]:
l = [0, 1, 2, 3, 4, 5]

l + 10

However, as you might have guessed, using NumPy this can indeed be done:

In [None]:
l = np.array([0, 1, 2, 3, 4, 5])

l + 10

The same goes for `*` and `/`:

In [None]:
l * 2

In [None]:
l / 2

These operations also apply to 2D arrays:

In [None]:
a = np.arange(9).reshape((3,3))
a

In [None]:
a + 100

In [None]:
a * 2.2

In [None]:
a * a

In [None]:
a

In [None]:
a ** a

Therefore, with array broadcasting, you can perform element wise operations on your arrays without the need for writing loops over your data. 

Not only this, but array broadcasting can be up to 100 times faster than looping over each element in a loop.

### Searching Arrays

We will finally briefly cover searching of arrays. 

You can specify conditions in order to search arrays for particular values that meet these conditions. 

Let's create an array and look at a few examples:

In [None]:
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
a

We can specify our conditions just how we specify indices. If we wanted to find all values greater than 5, we can do so as follows:

In [None]:
a[a > 5]

Note that it returns an array of values, and loses its shape. The array a was a 2D array, and we got back a 1D array of values. This is often what you want, but not always, so be aware of it. 

## Exercise 2 

Create slices for each of the colours represented here:

<img src="Images/slicing-task.png" width="300px" />

To be clear:

- the orange coloured slice includes numbers 3 and 4.
- The red slice is the entire column, 2, 8, 14, 20, 26, and 32
- The blue slice is square shaped, and includes 28, 29, 34, and 35

Step 1:

Recreate the array above. It has 36 elements, numbers 0-35, in the form of a $6 \times 6$ 2d array. You can do this using `arange()` and `reshape()`.

Store this in a variable called `a`:

In [None]:
a = np.arange(36).reshape((6,6))
a

Step 2:

Now perform the slices for each of the coloured elements above:

In [None]:
# Your code here



---

# Plotting

We will cover some basics of plotting in Python now. 

The most use library in Python for plotting is Matplotlib. Seaborn is another package that is popular. 

In this seminar we will use Matplotlib, but the concepts will apply also to Seaborn and other plotting libraries.

First let's import. Convention is to import `matplotlib` as `plt`, so often you will see the following:

In [None]:
import matplotlib.pyplot as plt

import numpy as np
import random

In [None]:
x = [1, 2, 3, 4]
y = [3, 7, 9, 12]

plt.plot(x, y)

A lot of adjustments can be made to the look of your plots. The basics are for example the line style or the marker style, as well as colour options.

Here we adjust the colour to red, use round markers, and define the a dashed line style.

In [None]:
plt.plot(x, y, color='green', marker='^', linestyle='--')

You can also add a label to your plots.

In [None]:
plt.plot(x, y, label='Data 1', color='red', marker='o', linestyle='--')
plt.legend()
plt.show()

Be aware that you then need to call `plt.legend()` before calling `plt.show()`. 

This is because you label each line on your plot individually and then called `plt.legend()` to apply the labels to a legend.

Speaking of multiple lines, we can just add them one by one before calling the final `plt.show()`, for example:

In [None]:
plt.plot(x, y)
plt.plot(x, [x*x for x in x], label='y = x²')
plt.legend()
plt.show()

You can add a title to your plot using `plt.title()`:

In [None]:
plt.plot(x, y)
plt.plot(x, [x*x for x in x], label='y = x²')
plt.legend()
plt.title("Line Plot Example")
plt.show()

You can change the look and feel of the plots quite easily using styles. 

In [None]:
plt.style.use('ggplot')

This is now set to mimic the style of ggplot, the most commonly use R plotting library which you will learn about in the next half of the course.

If we now make another plot, we will see the new style:

In [None]:
plt.plot(x, y)
plt.plot(x, [x*x for x in x], label='y = x²')
plt.legend()
plt.title("2 Line Plot Example")
plt.show()

You can find all styles by using the following:

In [None]:
plt.style.available

Some are quite nice, such as `fivethirtyeight`:

In [None]:
plt.style.use('fivethirtyeight')

plt.plot(x, y)
plt.plot(x, [x*x for x in x], label='y = x²')
plt.legend()
plt.show()

The most common plots are supported:

- Line plots
- Scatter plots
- Bar plots
- Histogram plots
- Box plots
- Pie charts
- etc.

In [None]:
plt.scatter(np.random.randn(100), np.random.randn(100))
plt.show()

Bar plots:

In [None]:
psoriasis = ['Pustular', 'Erythrodermic', 'Guttate']
counts = [10, 7, 5]

plt.bar(psoriasis, counts)
plt.title("Psoriasis Counts")
plt.show()

Histogram plots:

In [None]:
import numpy as np

data = np.random.randn(1000)
plt.hist(data, bins=30)
plt.title("Histogram Example")
plt.show()

If you want to export a figure, this can be done using the `savefig()` function:



In [None]:
plt.plot(x, y)
plt.savefig("plot.png")

Matplotlib will infer the file type from the extension. PDF is useful as they are saved as vector graphics and can be embedded into publications. PNG might be more useful for embeddeding in Word documents or sharing plots on the web.

Other options that might be useful are the following:

- `plt.grid(True)`
- `plt.xlim(0, 5)`
- `plt.ylim(0, 10)`

In [None]:
plt.plot(x, y)
plt.grid(False)

In [None]:
plt.plot(x, y)
plt.xlim(-1, 5)
plt.ylim(-1, 13)
plt.show()

## Exercise 3

### Histogram

Create a histogram with 1,000 random numbers. 

To generate the 1000 random numbers you can copy and paste the snippet below:

```python
data = np.random.randn(1000)
```

- Create the histogram with 25 bins.
- Add grid lines and a title.

In [None]:
# Answer here

## Exercise 4

Create a plot with **three** lines. We have already defined `x` and `y` above, so we will first create another list called `z`. The `z` list needs to be the same length as `x` and `y`.

Line 1 should be `x` vs. `y`, line 2 should be `x` vs. `z`, and line 3 should be `z` vs. `y`.

Style each line differently. One with round markers, one with square markers, one with triangular markers.

Give each line a width of 2.

Give each line a label. 

Use the theme `bmh`.

Plot it and save it as a PNG, with a DPI (dots per inch) of 300.

In [None]:
plt.plot?

In [None]:
z = [1.5 ,4.5, 10.9, 15.5]

plt.style.use('bmh')

plt.plot(x, y, linewidth=2, marker='o', label='x vs. y')
plt.plot(x, z, linewidth=2, marker='^', label='x vs. z')
plt.plot(z, y, linewidth=2, marker='s', label='z vs. y')

plt.legend()

plt.savefig('plot.png', dpi=300)

plt.show()

# SQLite

Databases do not work like regular files. You have to connect to them, and once a connection is established, you need to create a "cursor" which defines where in the database you currently are, like a pointer.

We shall do this now to a sample database we have on our computer:

In [None]:
import sqlite3

conn = sqlite3.connect("sample_db.sqlite")
cursor = conn.cursor()

To get data you need to run SQL commands. SQL commands are their very own type of language. 

A database typically consists of several tables. These tables are often joined via relationships. We will not go in to databases in much detail here. We can view the contents of the database under the following URL: <https://sqliteviewer.app/> (open the link and click "**Load a sample**"—this will open the same database we will open here in the notebook).

To get data from the database, you use the `SELECT` command. Here we select everything (using `*`) in the `Invoice` table:

In [None]:
cursor.execute("SELECT * FROM Invoice")
rows = cursor.fetchall()

We can now look to see how many rows were retrieved, and maybe have a look at the first few rows:

In [None]:
len(rows)

In [None]:
rows[:5]

We may want to know what these fields all are, and for this we can use the `PRAGMA`

In [None]:
cursor.execute("PRAGMA table_info(Invoice)")
columns = cursor.fetchall()

In [None]:
columns

We can do very advanced searches using SQL, for example:

In [None]:
cursor.execute("SELECT BillingCity, Total FROM Invoice WHERE BillingCountry = 'United Kingdom'")
rows = cursor.fetchall()

Let's inpect the rows now:

In [None]:
len(rows)

In [None]:
rows

What if I wanted to find the total sales?

To do this, we can use the `SUM()` function:

In [None]:
cursor.execute("""
    SELECT
        SUM(Total) AS total_sales
    FROM Invoice
    WHERE BillingCountry = 'United Kingdom'
""")

total_sales = cursor.fetchone()

In [None]:
total_sales

Imagine you wanted to break the orders down per city? 

This time we will search for orders from the USA, and use `GROUP BY` along with the city:

In [None]:
cursor.execute("""
    SELECT
        BillingCity,
        COUNT(*)   AS total_orders,
        SUM(Total) AS total_sales
    FROM Invoice
    WHERE BillingCountry = 'USA'
    GROUP BY BillingCity
    ORDER BY total_sales DESC
""")

rows = cursor.fetchall()

In [None]:
rows

What if we wanted the total sales, per month, per country?

Well we can use the `GROUP BY` command twice:

In [None]:
stmt = """
SELECT
    strftime('%m/%Y', InvoiceDate) AS year_month,
    BillingCountry,
    COUNT(*)        AS total_orders,
    SUM(Total)      AS total_sales
FROM Invoice
GROUP BY year_month, BillingCountry
ORDER BY year_month, total_sales DESC;
"""

cursor.execute(stmt)

rows = cursor.fetchall()

In [None]:
rows

## Interfacing with Pandas

Pandas can run SQL statements on a database directly.

Using our `conn` earlier, this is done as follows:

In [None]:
df = pd.read_sql_query(
    "SELECT * FROM Invoice",
    conn
)

In [None]:
df.head()

Let's say I want the order per country and to plot them:

In [None]:
stmt = """
    SELECT
        BillingCountry,
        COUNT(*)   AS total_orders,
        SUM(Total) AS total_sales
    FROM Invoice
    GROUP BY BillingCountry
    ORDER BY total_sales DESC
"""

df = pd.read_sql_query(stmt, conn)

In [None]:
df

In [None]:
df.plot(
    kind="bar",
    x="BillingCountry",
    y="total_sales"
)

### Write to SQL Database

Of course, you can write to a database, and can be done directly from within Pandas:

In [None]:
df = pd.read_sql_query(
    "SELECT * FROM Invoice",
    conn
)

df.tail()

So if you remember your Pandas from above, we can use the `loc` indexer to change a value.

Here we are changing the data in the **Data Frame** for now:

In [None]:
df.loc[411, "Total"] = 1000.00

In [None]:
df.tail()

Now that we have changed the data in the Data Frame, `df`, we can write this back to the SQL database.

This will override the data if we use the `if_exists="replace"` parameter:

In [None]:
df.to_sql("Invoice", conn, if_exists="replace", index=False)

Now that this has been writtten back to the database, we can re-read the data from the database:

In [None]:
df = pd.read_sql_query(stmt, conn)  # stmt is defined above.

And now plot again:

In [None]:
df.plot(
    kind="bar",
    x="BillingCountry",
    y="total_sales"
)

It is good practice to close a connection once you are done with it, and we do so as follows:

In [None]:
conn.close()

You can see if we try to run a command on the database now, it will not work.

In [None]:
cursor.execute("SELECT * FROM Invoice WHERE BillingCountry = 'Ireland'")

---

# Machine Learning

# Machine Learning and SciKit Learn

In this section we will cover the bascics of **Machine Learning**. 

We cover this last, as a prerequisite for any 

## What is Machine Learning

What exactly is **machine learning**?

To quote Arthur Samuel, a early AI pioneer, it is "the field of study that gives computers the ability to learn without explicitly being programmed". 

In other words, machine learning gives computers the ability to learn algorithms. These are the algorithms a computer programmer would normally have to design for himself or herself, but machine learning has the ability to learn algorithms merely by looking at data. 

So, unstead of writing instructions like "if X then Y", you provide the machine learning algorithm examples (data).

The computer analyses these examples, identifies patterns, and builds a model that can be used to make predictions on new, unseen cases.

### Simple Example

Diagnosing disease from symptoms:

- Input: Age, blood values, lab values.
- Output: Diagnosis.
- Process: Provide many past cases with known diagnoses. The algorithm learns which patterns in the inputs are associated with which diagnoses. This is also known as **training** the algorithm.
- Result: For a new patient, the model predicts the most likely diagnosis.

Important to note: no understanding, reasoning, or intent is involved. Only statistical pattern extraction.

But why would you want to write an algorithm like this? Why not programme it explicitly? The main reason is that the rules are far too complex to write down and actually programme. Or, we do not even know the underlying patterns behind the data, and have yet to discover them. But, we may have lots of data, and wish to see if a pattern in this data does exist. 

### Recent Developments 

Machine learning is both an established field, and is also the technology behind many of the innovations we have seen recently, such as ChatGPT and Large Language Models in general, as well as image generation, image classification, voice and speech understanding and synthesis, and so on.

In medicine, for example, machine learning is used extensively, with recent devlopments in image classification being perhaps the most prevalent. For example, in dermatology machine learning is used to classify skin lesions and moles, in pathology it is used to to analyse histological samples and classify disease, in radiology it is used to segment and isolate tumours 3D MR imaging, and so on.

![Example Medical Images](./Images/example-med-images.png)

> Woerner, S., Jaques, A. & Baumgartner, C.F. A comprehensive and easy-to-use multi-domain multi-task medical imaging meta-dataset. Nature Scientific Data 12, 666 (2025). https://doi.org/10.1038/s41597-025-04866-4

ChatGPT, image classification, image generation, and so on have mostly been developed through the use of **Deep Learning** which are basically very large neural network algorithms. We will not cover deep learning in depth in this course, however if you are interested in this topic, the advanced course covers deep learning in some detail, especially the classification of medical images.

In this course, we will study more 'classical' algorithms and apply them to tabular-based datasets. 

## Categories of Machine Learning

There are two main types of machine learning approach, supervised methods and unsupervised methods. The vast majority of machine learning methods are supervised. We will cover both supervised and unsupervised algorithms here, but with more of a focus on supervised methods.

### Supervised Learning

A supervised machine learning problem is defined as one where you train it with examples the include both the input data **and** the correct outcome. 

Therefore, each training sample/example/patient consists of:

- Input ($X$): measurements, lab values, etc.
- Label ($y$): the known correct diagnosis

The machine learning algorithm learns a mapping from the inputs to the labels, so that you get end up with a function, that takes the data as its input and provides a prediction:

$$
f(X) \approx y
$$

The procedure is more or less as follows:

- Provide the machine learning algorithm many examples
- The algorithm makes predictions
- You calculate the error by comparing the prediction with the true value
- The parameters of the model are adjusted to reduce the error
- Once trained, you apply this trained function to new, unseen data

Within supervised learning there are also two main subtypes **classification** and **regression**. 

- Classification: the output is a categroy, such as disease, benign vs. malignant, etc.
- Regression: the output is a number, such as blood pressure, disease progression, tumour size, etc.

### Unsupervised Learning

In unsupervised learning, there is not target variable. In other words, we do not have any labels for our data. 

Unsupervised learning is generally used for exploratory analysis of data. One example is clustering, where data is group in to clusters based on the similarity of the data.

In this seminar we will cover a clustering algorithm and see how we can apply it to some data.

## Key Terminology

Before we go any further, let's cover some basic terminology and we will also cover some conventions we will use for writing our code.

Let us define the following terms:

- **Features**: these are properties about the sample in question. In a house dataset, this would be square metres, does it have a pool, neighbourhood, etc.
- **Labels** or **targets** or **classes**: this is what you are trying to predict. They are also what you need to supply an algorithm in a supervised setting, during training. In the house price dataset, this is the price of the house.
- **Training data**: what we use to train our algorithm.
- **Test data**: what we use to validate and test our algorithm. These are normally a subset of your total data.

By convention your training data is stored in `X`, uppercase and your label data is stored in `y`, lowercase.

Also, if you are splitting our data into training and test sets, you will see the following `X_train`, `X_test` will contain our training data, and `y_train`, and `y_test` will contain our corresponding labels.

We will explain later exactly what the purpose of the training and test set data are.


## Summary

This table summarises the various types of machine learning approaches, with an application for each:

| Input (X)           | Output (y)               | Application                 | Type                       |
| ------------------- | ------------------------ | --------------------------- | -------------------------- |
| email               | spam (0/1)               | Spam filter                 | Classification, supervised |
| house details       | house price (numerical)  | Real estate price predictor | Regression, supervised     |
| patient data        | disease subtype clusters | Cancer research             | Clustering, unsupervised   |


## Sci-Kit Learn

When doing any scientific computing in Python, you will eventually come across the Sci Kit Learn library. 

Sci-Kit Learn is a collection of machine learning algorithms, for both supervised and unsupervised learning. 

![Scikit-Estimators](./Images/scikit-estimators.png)

From the help pages you can see that there are dozens of algorimths implemented in the SciKit Learn library. 

<https://scikit-learn.org/stable/supervised_learning.html>

Most algorithms included in SciKit Learn, where possible, use a common API so that once you have learned to use one algorithm in Sci-Kit Learn, you will have learned them all. 

## Machine Learning Workflow

The basic machine learning workflows is as follows:

1. Gather and prepare/clean data
2. Train model
3. Evaluate model

These three phases describe the basic machine learning workflow, and we will cover each of them today. 

## Gather And Prepare Data

A question that is often asked, is how much data do you need? This is a difficult quetion to answer - however, basically this can be thought of in terms of how difficult the problem is to solve. If your dataset regards classifying house prices, based on area of the house, numbers of rooms, and so on, then this might not be a very difficult problem to solve and won't require much data, perhaps 50 or 100 samples. However, if your problem is to predict the effect of some genetic mutation, where the number of features might be in to the 100s, then you might need many thousands of samples before you can train a algorithm. 

The same can said for image based problems. If your algorithm needs to differentiate between cars and bicycles, then this would require a lot less data than an algorithm that should differentiate between 100 different car models. 

If you want to use Sci-Kit Learn, it will expect your data to be in a particular format and to follow certain conventions. Luckily, however, once your data is in the format expected by Sci-Kit Learn, almost all of the SciKit Learn's algorithms can be used interchangably, without any modifictions to the data itself or how you have prepared it.

So, assuming you have access to some dataset, generally the first thing you will want to do is import and explore the data. 

There are several tools you could use to do this, however SciKit Learn plays nicely with both NumPy and Pandas (which we saw previously) so we will also use these tools to load in our data and explore the data. 

If your data is in CSV, Excel or other text-based format then Pandas offers quick and easy functionality to deal with such data. 

## Skin Disease Dataset

To demonstrate things, we will take a look at a skin disease dataset and use Pandas to read the data, explore the data, clean the data, and prepare it for use in SciKit Learn. This is step 1 of the 3 step procedure mentioned above.

The dataset in question of erythemato-squamous skin diseases (ESDs), and we will analyze and classify these lesions. To do this, we use a dataset that contains specific observations or features for 366 skin lesions from patients. The features are visual characteristics recorded by a dermatologist, including, for example, "itching", "scaling", and "erythema".

![Psoriasis](./Images/psoriasis-edit.jpg)

> Image credit:
> 
> Dash, M et al. "A cascaded deep convolution neural network based CADx system for psoriasis lesion segmentation and severity assessment". Applied Soft Computing, 91 (2020).

Some background regarding the disease:

There are six different types of erythemato-squamous diseases, such as psoriasis, which exhibit very similar clinical features and for which signs and symptoms may overlap. A further difficulty in differential diagnosis is that a disease in its early stage may display the features of another disease and only reveal its characteristic features at later stages.

As already mentioned, the dataset consists of 366 lesions, each with 12 features and a diagnosis. The diagnosis is one of six possible erythemato-squamous diseases.

Below is an overview of the individual features in the dataset:

| Feature                    | Description                                                     | Values     |
| -------------------------- | --------------------------------------------------------------- | ---------- |
| Erythema                   | Reddening of the skin caused by increased blood flow            | 0, 1, 2, 3 |
| Scaling                    | Scaly or peeling skin surface                                   | 0, 1, 2, 3 |
| Definite Borders           | Well-defined borders around skin lesions                        | 0, 1, 2, 3 |
| Itching                    | Unpleasant sensation that provokes the urge to scratch          | 0, 1, 2, 3 |
| Koebner Phenomenon         | Development of skin lesions at sites of trauma or injury        | 0, 1, 2, 3 |
| Polygonal Papules          | Raised, small, polygonal skin papules                           | 0, 1, 2, 3 |
| Follicular Papules         | Small papules appearing at hair follicles                       | 0, 1, 2, 3 |
| Oral Mucosal Involvement   | Presence of lesions or symptoms in the oral mucosa              | 0, 1, 2, 3 |
| Knee and Elbow Involvement | Presence of lesions or symptoms on the knees or elbows          | 0, 1, 2, 3 |
| Scalp Involvement          | Involvement of the scalp                                        | 0, 1, 2, 3 |
| Family History             | Indicates whether there is a family history of the skin disease | 0, 1       |
| Age                        | Linear variable representing the patient’s age                  | Continuous |

All features have discrete values from 0 to 3, except for family history, which is binary (true or false), and age, which is continuous.

The data were collected through **clinical observation** by a **dermatologist** during patient examination. Most of these features could also be derived from inspection of lesion images (e.g., images acquired with a digital dermatoscope), while others would need to be obtained during patient assessment, such as "itching", as well as "family history" and "age".

In addition, each lesion is associated with a **diagnosis**, which is a categorical variable, namely: **Psoriasis**, **Seborrheic Dermatitis**, **Lichen Planus**, **Pityriasis Rosea**, **Chronic Dermatitis**, or **Pityriasis Rubra Pilaris**.

The dataset is described in detail in the following work:

> Güvenir, H. Altay et al. "Learning differential diagnosis of erythemato-squamous diseases using voting feature intervals." *Artificial Intelligence in Medicine*, 13(3), 1998, 147–165.

### Load the Data

As mentioned, Pandas has a number of functions for reading and importing various data types. 

Our skin lesion dataset is in tab-separated values (TSV) format, similar to CSV but using tabs to delimit columns.

Before loading it with Pandas, we can preview it within Jupyter itself. Several common file formats can be read directly in Jupyter, although in order to analyse the data we must load it in Python.

When looking at a dataset like this, it makes sense to examine the data to check for things like

- Inconsistent naming, such as a disease column using different spellings of terms: haemoglobin and hemoglobin or abbreviations are sometimes used and other times not
- Missing values, often these appear either as blank or they contain a `?` or `NA` or something similar
- Outliers: an age column containing nonsensical numbers (year of birth entered instead of age)

What we can notice in our dermatology dataset is that the `age` column does contain some missing values, which have neen substituted for `?` characters.

Bearing this in mind, we will use the `read_csv()` function to read the data and store it in `derma`:

In [None]:
import pandas as pd

derma = pd.read_csv('dermatology-clinical-only.tsv', sep='\t', na_values='?')

Notice that we define the seperator as tab using `sep='\t'` - in the case of a CSV file, this does not need to be defined as by default it is the comma `,` symbol.

Also notice, we have instructed Pandas to treat any occurance of a `?` as a missing value, using the `na_values` parameter. When reading the file, Pandas will now give `?` characters a special designation as a missing value. How we deal with these missing values we will see shortly.

There are a large number of formats that can be read by Pandas, as you can see if you search for functions beginning with `read_`:

In [None]:
pd.read_

In terms of options, there are many ways to configure how you read in your data:

In [None]:
pd.read_csv?

However, you will notice that generally you will only need to configure one or two of these when loading data. 

The most commonly use options here are:

```python
pd.read_csv("data.csv",
            sep=",",                   # The delimiter to use
            header=0,                  # Which row contains your column names, if at all?
            index_col=None,            # Which column to use as an index or id
            na_values=["NA","", "?"])  # How to handle missing data (more on this later)
```

So, we now have our data, which we stored in `derma` we can preview it. If the dataset is very large, you probably do not want to print the entire contents of the dataset to the screen, therefore two commonly used functions are `head()` and `tail()`:

In [None]:
derma.head()

By default, the first 5 rows are shown but you can change this by passing a different value:

In [None]:
derma.head(10)

Also, to preview the end of the dataset, the `tail()` function can be used:

In [None]:
derma.tail(3)

Sometimes, there may be too many columns to fit across the screen. In that case, you can either list the columns, this is done as follows:

In [None]:
derma.columns

or you could use the transpose function, which basically pivots a table so that the columns are rows and vice versa:

In [None]:
derma.T

Some other functions you might want to run when you first load a dataset are:

In [None]:
derma.shape

This tells us the number of rows (samples) and the number of columns in the dataset.

The `info()` and `describe()` functions are also 

In [None]:
derma.info()

Here we see that `age` has some missing values for example. 

The `describe()` function can also offer us some insight:

In [None]:
derma.describe()

Something else that you might want to do is check the class distribtion of your dataset. For example, in this case, we have a dataset where we have a number of possible diagnoses. One thing you may want to do is to see how many of each class you have in your dataset. This will influence how you might want to analyse the dataset.

We can do this quite easily using the `value_counts()` function, which we can apply to the `diagnosis` column:

In [None]:
derma['diagnosis'].value_counts()

Pandas has some simple built in plotting capabilites, so we could plot this using the `plot()` function to see the relative differences:

In [None]:
derma['diagnosis'].value_counts().plot(kind='bar')  # Specify bar chart

What we see here is that psoriasis is the most requent diagnosis. Also pityriasis is very much the minority diagnosis. 

One quick way to calculate if the distribution might be problematic, is to get the ratio of the largest and smallest classes. If this value is over 5 you may need to alter your training and evaluation strategy.

Rule of thumb:

- < 2 → balanced
- 2-5 → mild/moderate imbalance
- \> 5 → significant imbalance

Let's compare the ratio of the most frequent and least frequent diagnoses: 

In [None]:
diag_counts = list(derma.diagnosis.value_counts(sort=True))  # Surround with list() to get a standard Python list back

In [None]:
diag_counts

We can use indexing the get the first and last items in the list:

In [None]:
diag_counts[0] / diag_counts[-1]

The ratio is 5.6, which is borderline according to our rule of thumb above. Values lower than 5 are considered to be acceptable, which a ratio of <2 poses no issues at all.

Why would this alter how you analyse your data you might ask?

- Training: the model will see many more examples of one class, and very few examples of another class, meaning it will be biased towards diagnosing psoriasis: one possible solution is oversampling
- Evaluating: if your test set is dominated by one class, you must be very careful when interpreting results. For example, your test set contains 90% psoriasis samples, and my model predicts always psoriasis, then accuracy will look very good, at 90%, but this is not indicative of the actual performance of the model

The value of 5.6 that we have recieved does not mean we cannot use this dataset however. It means we need to be wary of how we approach our analysis.

We will discuss these topics a little bit later.

### Missing Data

Next, we should look to see if we have any missing data in our dataset.

This is important because some algorithms cannot handle missing values, and will fail if data is missing. 

This is such a common task, that Pandas has a built-in function to quickly do this called `is_na()`:

In [None]:
derma.isna().sum()

According to this, we have 8 missing values in the age column.

If there were multiple columns with missing values, we could sum them with an additional `sum()` call:

In [None]:
derma.isna().sum().sum()

However, in our case all our missing data is in the `age` column.

In percentage terms, we can do the following:

In [None]:
derma.isna().mean() * 100

At just over 2%, our dataset is not being impacted by these missing values by a large amount.

We can quickly find the rows that contain the missing values as follows:

In [None]:
derma[derma.isna().any(axis=1)]  # .any() says to return a row, if any of the values in the row is missing. 
                                 # The .all() function would return only rows where all values are missing.

As you can see, Pandas has replaced the `?` with `NaN`, meaning 'not a number'. This is an internal data type that represents a missing value. It is not the same as `None` or `Null` which may be a legitmate value.

**Note**: when loading our dataset, we knew missing values were being written as `?`. If we want to check for other common strings that are used to represent missing data, we can say the following: 

In [None]:
derma.isin(['', 'NA', 'N/A', 'null', '?']).sum()

If others are found here, we could reload the data specifying `NA` as a missing value also, using the `na_values` parameter, **or** we could replace these values with `NaN`:

```python
derma.replace(['', 'NA', 'N/A', 'null', '?'], np.nan, inplace=True)
```

### Dealing With Missing Data

As mentioned previously, some algorimths cannot handle missing values and require datasets to be complete before they can be analysed.

What could we do in the case of our dermatology dataset?

We could drop those rows, in fact this is quite commonly done, and we would only lost 2% of our data if we did so. 

To drop them we'd use the `drop_na()` function: 

In [None]:
derma.dropna()

Notice we have 358 rows. 

Also notice, the `drop_na()` returns a copy of your dataset, the original is not affected:

In [None]:
derma

So, we still have 366 in our `derma` dataset. Which is good because we do not want to just drop the rows. There are better approaches. 

For example, for a numerical field such as `age`, we could take the mean or, if appropriate, the median value (for skewed data with outliers), using Pandas' `fillna()` function:

In [None]:
derma.age.fillna(derma.age.median())

We need to do this on a column by column basis. 

**Note**: again, the function has returned a copy of the data (in this case the age column on its own), hence our original data has not been altered. 

To fix this, we can assign the returned value to itself:

In [None]:
derma["age"] = derma["age"].fillna(derma["age"].median())

In [None]:
derma.info()

**Note**: mean and median work for numerical values, for categorical values, such as the diagnosis field, you will need to use the **mode**, for example. There are also methods for ordered data, such as backwards or forwards fill, and methods for time series data.

We can make a final of all columns check using `isna()`:

In [None]:
derma.isna().sum()

### Other Data Preprocessing

There are quite a number of things you could do to this data, such as:

- Handle missing values
- Remove or correct outliers
- Normalize/standardize features
- Encode categorical variables
- Create new features (feature engineering) by combining features
- Split data into train/validation/test sets

### Preparing Data for SciKit Learn

Once your data has been cleaned, we will want to prepare it for SciKit Learn.

SciKit Learn expects:

- Training data: 2D array, shape `(n_samples, n_features)`
- Labels: 1D array, shape `(n_samples,)`

Also, it is convention to use `X` for your data and `y` for your labels. A capital `X` is used as it represents a matrix (2D array, tabular data), while `y` is a 1D array (vector) and are conventionally given a lower case letter.

Hence we will save our data to `X`, and our labels to `y` as follows:

In [None]:
X = derma.drop(columns='diagnosis')
y = derma['diagnosis']

In [None]:
y.head()

In [None]:
X.head()

As mentioned above, our data has to have the correct shape, we can check this:

In [None]:
X.shape

In [None]:
y.shape

If we take a look at our `y` labels above, we will see that we have the names of the diagnoses as text. Representing our classes as text is not possible with most algorithms, and will only accept numerical classes. 

Therefore we will represent each diagnosis with a numerical value. What values we give them does not really matter, so we could do something like this:

```python
derma['diagnosis'] = derma['diagnosis'].replace({
    'psoriasis': 1,
    'seborrheic dermatitis': 2,
    'lichen planus': 3,
    'pityriasis rosea': 4,
    'chronic dermatitis': 5,
    'pityriasis rubra pilaris': 6
})
```

The issue here is that we need to keep track of these values, and to analyse our results later we will need to remember that lichen planus is represented by 3, for example. 

Luckily Sci-Kit Learn as a label encoder, which encodes our diagnoses numerically, but also can be used to keep track of the association/mapping: 

In [None]:
y

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y = encoder.fit_transform(y)

In [None]:
y

If we want to check the mapping, we can do so using the label encoder itself:

In [None]:
encoder.classes_

This will save our mapping from the numerical values to the disease names.

## Training a Model

The general approach to training a machine learning model is similar for most algorithms.

SciKit-Learn's functionality follows a consistent pattern throughout its built-in machine learning algorithms:

- Initialize the estimator with some set of parameters
- Prepare data using methods such as `train_test_split()`
- Fit to training data (using `fit(X, y)`)
- Predict on new data (using `predict(X)`) - new data generally being your test set.
- Evaluate results with built-in metrics tools, such as `classification_report()` and `confusion_matrix()`

This consistent interface makes it easy to swap different algorithms while maintaining the same workflow structure.

By convention, training and testing splits are saved as:

- `X_train` and `y_train` for your training data and labels
- `X_test` and `y_test` for your test data and labels

Sci-Kit Learn has a built in train/test split method (`train_test_split()`), which we use now to create our train and test data:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Now we have our data randomly split in to training data and testing data.

**Note** that `train_test_split()` accepts Pandas Data Frames or NumPy arrays.

We can preview the data to make sure it has done what we expected:

In [None]:
X_train.head()

It is good practice to confirm the sizes of our training and testing data and thei labels:

In [None]:
X_train.shape

In [None]:
y_train.shape

Now let's train the random forest algorithm. 

In [None]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier()

forest.fit(X_train, y_train)

Once train, which will only take a few seconds at most, we can get predictions on our **test set**:

In [None]:
y_pred = forest.predict(X_test)

In [None]:
y_pred

We use the test set so that we give the algorithm data that it has never seen. This simulates a new patient for example. 

If you pass an algorithm data it has seen before, it means that it has 'seen' the answers.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

If you are curious about what features contributed the most to the classifications, many algorithms allow you see these, as they are stored in `model.feature_importances_`, as we see below:

In [None]:
importances = pd.Series(
    forest.feature_importances_,
    index=X.columns
).sort_values(ascending=False)

importances.head(10)

The confusion matrix above is the raw output and not really very readible. We can use the saw output above to generate a plot using `ConfusionMatrixDisplay()`

In [None]:
import matplotlib.pyplot as plt
disp = ConfusionMatrixDisplay(confusion_matrix=confusion_matrix(y_test, y_pred), display_labels=encoder.classes_)
disp.plot(cmap="Blues")
plt.xticks(rotation=45, ha="right")
plt.grid(False)
plt.show()

The confusion matrix is one of the most important visualisations you can perform to assess model performance!

### Visualising a Tree

Our Random Forest is collection of decision trees: this is called an ensemble.  

We can take an individual tree from this forest and visualise it:

In [None]:
from sklearn.tree import plot_tree

# select one tree from the forest
tree = forest.estimators_[0]

plt.figure(figsize=(18, 8))
plot_tree(
    tree,
    feature_names=X.columns,
    filled=True,
    rounded=True,
    max_depth=2
)
plt.show()

As you can see, a tree is a collection of true/fales rules. These rules have been learned during training, based on the data you have given it.

## Regression

In this section we will work on a regression problem, to demonstrate how it differs from classification. 

Regression differs from clasification in that we are now dealing with a target, $y$, that is a continuous number. This could be house price, heart rate, blood pressure, and so on. They are **not** discrete values like in our classification examples above.

Hence, the way we evaluate the regression models is different. We cannot calculate an accuracy, as we do not have discrete classes, and therefore must use other methods of evaluation. 

The way in which we train and evaluate regression models is also different to classification. 

To train a regression model, we must use algorithms that are specifically designed to perform regression. In the example below we will use Linear Regression, but there are dozens of different algorithms implemented in SciKit Learn: <https://scikit-learn.org/stable/supervised_learning.html>

## House Price Dataset

To demonstrate regression we will use a very simple house price dataset that consists of the area of the house and its sale price. 



In [None]:
house_price = pd.read_csv('house-price.csv')

In [None]:
house_price.head(10)

Here we see the size of the house in square feet, and the price of the house in thousands. 

In [None]:
house_price.plot.scatter(x='sqft', y='price')

As the area of the house increase, as does its price. 

We will now use Linear Regression to model this relationship. 

In [None]:
from sklearn.linear_model import LinearRegression

X = house_price.sqft.to_frame()  # to_frame required as Series is not accepted
y = house_price.price

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

Now we have trained our linear regression algorithm, we can test it on unseen data:

In [None]:
y_pred = lin_reg.predict(X_test)

In [None]:
y_pred

These are the predicted values for the houses in the test set. We can visualise this a bit better as follows:

In [None]:
for area, price in zip(X_test.sqft[:10], y_pred[:10]):
    print(f"Area: {area} sq. ft.\t Predicted price:\t{price:.2f}k")

To evaluate a regression problem, you can measure the difference between the preedicted value and the actual value. This is given by the `score()` function, which is available for regression algorithms only. 

Conversely, there is no `accuracy_score()` like you would get in a classification model.

To compute the score, we can say:

In [None]:
lin_reg.score(X_test, y_test)

The score returned is known as the $R^2$ score, and it can be interpreted as follows:

- $R^2=1.0$: perfect predictions
- $R^2=0.0$: no better than predicting the mean of $y$ for every sample

The `score()` function gives you a nice value that is easy to quicky understand: the closer to 1 the better. However, it doesn't give you the error which is the difference between the predictions and true values.

Therefore, another other metric which is often used is the **mean absolute error**:

In [None]:
from sklearn.metrics import mean_absolute_error

MAE = mean_absolute_error(y_test, y_pred)

In [None]:
MAE

As you can see, the mean average error is not a value between 0 and 1.0, it shows that the predictions of the model are on average are off by 13.8k. 

How good or bad this error is depends on the data. If this was a car price predictor, 14k error on average would not be very good. For house prices, 14k error is actually very close.

We can visualise this error for each of the data points:

In [None]:
plt.figure()
plt.scatter(X_test["sqft"], y_test, label="Actual")
#plt.scatter(X_test["sqft"], y_pred, label="Predicted")
plt.plot(X_test["sqft"], y_pred, color='red', label='Fitted Line')
plt.xlabel("Square Footage")
plt.ylabel("House Price (k)")
plt.legend()
plt.show()

The mean absolute error is measuring the average difference between the prediction and the actual price.

In this toy example, we only had one feature - square footage - in real world datasets of course we can have many dozens of features. Linear Regression is not limited to just one feature, we simply used this simple dataset so that we could plot the fitted model and so on.

## Exercise 5

Perform a regression analysis on a dataset relating to *Abalone*. 

Abalone are large marine gastropod mollusks (sea snails) in the family *Haliotidae*. They live on rocky coastlines, cling to surfaces with a powerful muscular foot, and graze on algae.

The dataset description states:

> Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope - a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

It seems it would be very beneficial if we could predict this age! 

Here is the code to import the data (we will use a package called `ucimlrepo` to fetch the data from the internet):

In [None]:
y

In [None]:
from ucimlrepo import fetch_ucirepo 

# fetch dataset 
abalone = fetch_ucirepo(id=1) 
  
# data (as pandas dataframes) 
X = abalone.data.features 
y = abalone.data.targets 

# Remove sex as it is the only non-numeric field
# and it makes the analysis much easier later
X = X.drop('Sex', axis=1)

# variable information 
print(abalone.variables)

We can preview the data here:

In [None]:
X.head(5)

And preview the target `y` here:

In [None]:
y.head(5)

Print the number of columns and rows:

In [None]:
print(f"X has {X.shape[0]} rows and {X.shape[1]} columns/features.")
print(f"y has {y.shape[0]} rows and {y.shape[1]} column.")

We now have our data as Pandas Data Frames. Our data is stored in `X` and our labels are stored in `y`.

Now create a train test split of 70% training data and 30% testing data:

In [None]:
# Code here

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


Now train a Decision Tree Regressor algorithm, see <https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor>

Use default paramaters, except for max depth, which you should change to 3 (the parameter is named `max_depth`, so set `max_depth=3`).

**Important**: Store your decision tree regressor in a variable named `tree`.

The import has been provided for you:

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Code here
tree = DecisionTreeRegressor(max_depth=3)

tree.fit(X_train, y_train)

Now that you have trained the tree, make your predictions and store them in `y_pred` as usual:

In [None]:
# Code here

y_pred = tree.predict(X_test)


Calculate the mean absolute eror of your method

In [None]:
# Code here
from sklearn.metrics import mean_absolute_error

MAE = mean_absolute_error(y_test, y_pred)


In [None]:
MAE

Interpret this result, in written words:


This is the average difference in ages in years, between the predictions and the true age. 


Show the feature importances (code above can be used almost verbatim):

In [None]:
y_test

In [None]:
# Your code here
importances = pd.Series(
    tree.feature_importances_,
    index=X.columns
).sort_values(ascending=False)

importances.head(10)

Interpret this in plain English

The shell weight has the strongest correlation to age.

OR: The largest contribution to age is the weight of the shell.


If you have written all the code as specified, then the following should work:

In [None]:
from sklearn.tree import plot_tree

plt.figure(figsize=(18, 8))
plot_tree(
    tree,
    feature_names=X.columns,
    filled=True,
    rounded=True,
    max_depth=3
)
plt.show()


The tree has learned a set of yes/no rules that can be used to predict the age of an abalone. 


---

# Unsupervised Learning

Unsupervised learning is for data where you have no target variable. 

It is used far less frequently than supervised learned, and is mainly used for exploratory analysis of data. Typically unsupervised algorithms tend to cluster data in to groups, where more similar items tend to cluster together. This is done by trying to find sammples which have characteristics in common, and clustering these. 

## Clustering

One of the most simple methods is called $k$-means clustering. 

## Wisconsin Breast Cancer Dataset

SciKit learn has a number of built-in datasets, including the breast cancer dataset we will analyse now.

> Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.  They describe characteristics of the cell nuclei present in the image. A few of the images can be found at http://www.cs.wisc.edu/~street/images/

It contains the followng features:

- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension ("coastline approximation" - 1)

Also: 

- Target: Diagnosis (M = malignant, B = benign)
- Class distribution: 357 benign, 212 malignant

We will not be using the diagnosis target during training however! The $k$-means algorithm is an unsupervised learning algorithm and does not use the target variable to train!

First load the dataset:

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X = data.data          # features
y = data.target        # true labels (0 = malignant, 1 = benign)

print(X.shape)

We now have our data in the customary `X` and `y` data structures.

Notice that we use the `sklearn.datasets` module. 

This module contains a number of datasets that you can load straight in to SciKit Learn, which have already been prepared in the format that SciKit-Learn expects. 

See <https://scikit-learn.org/stable/datasets.html>

We can preview the data, but we shall it is not particularly interpretable:

In [None]:
X[0]

Now we define our clustering algorithm, stating that we want 2 clusters. Note, that in this case we know that there are two groups, but in a real-world clustering scenario, you might not know this and would need to experiment with the number of clusters.

So, let's define the algorithm and fit it to the data:

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X)

The algorithm will now have assigned each sample to either cluster 1 or 2. 

Because we do actually have the true labels, we can map each prediction to a class, so that cluster 1 is considered a classification to benign and cluster 2 is considered a classification to malignant. 

In [None]:
from scipy.stats import mode

def remap_clusters(true_labels, cluster_labels):
    new_labels = np.zeros_like(cluster_labels)
    for c in np.unique(cluster_labels):
        mask = cluster_labels == c
        new_labels[mask] = mode(true_labels[mask], keepdims=True).mode[0]
    return new_labels

clusters_aligned = remap_clusters(y, clusters)

We can print our metrics as we did for the classification algorithm earlier:

In [None]:
print("Accuracy:", accuracy_score(y, clusters_aligned))

print("\nConfusion matrix:")
print(confusion_matrix(y, clusters_aligned))

print(classification_report(y, clusters_aligned, target_names=['Benign', 'Malignant']))

And we can plot the clustering by first reducing the dimensionality to 2 using PCA, and then plotting these:

In [None]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, s=20)
plt.title("K-Means Clustering (PCA projection)")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()

Now that you have two groups, you can try to fit a line to seperate the groups, and use this line as a final classifier for any new data. 

Becuase unsupervised learning is not as common as supervised learning, we have not discussed it in much detail here. However it is worth knowing that technqiues such as clustering exist for exploratory analysis of your data, in the situation where you have no target variable at all.

---

© 2026, Marcus D. Bloice, licensed under <a href="https://creativecommons.org/licenses/by-sa/4.0/">CC BY-SA 4.0</a><img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg" alt="" style="max-width: 1em;max-height:1em;margin-left: .2em;"><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg" alt="" style="max-width: 1em;max-height:1em;margin-left: .2em;"><img src="https://mirrors.creativecommons.org/presskit/icons/sa.svg" alt="" style="max-width: 1em;max-height:1em;margin-left: .2em;">