# Day 1 - Afternoon

# Introduction to Machine Learning

<img src="./img/house-prices.png" width="50%">

*Image generated using ChatGPT 4*

So the topic of this course is of course Machine Learning, and more spefically practical or applied Machine Learning. So now we will discuss in general what Machine Learning actually is.

So what is machine learning? One definition states:

> *Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.*

*Arthur Samuel, 1959*

Arthur Samuel wrote a very early programme to learn to play the game of Checkers. The algorithm was designed to make random moves, and slowly learn which moves led to an advantage and which led to a disadvantage. It was designed to play against itself, and did so many 10s of thousands of times, eventually learning what good moves were and what bad moves were and to avoid them, and managed to learn to play Checkers to a top level: without being programmed to do this. This is a special type of Machine Learning method called Reinforcement Learning, and was one of the very first Machine Learning systems. It is used in many fields still today, including the system that beat the Go champion, AlphaGo. The basic premise, however, is that a computer can be made to learns from some data.

More formally however, you can think of a machine learning algorithm as something whereby a system takes an input, $X$ which is our data, and produces an output $y$, which is our prediction. In other words, it learns a mapping fron input $X$, our data, to output $y$, our prediction.

In **supervised learning**, we give our algorithm examples of input and output pairs: ($X$, $y$) pairs. Therefore, the algorithm will see input data and the correct 'answer' for this input data, and given enough of these pairs, should learn the mapping from input $X$ it output $y$. It should eventually learn that when given a new sample $X$, then it will give a very good prediction for $y$.

The algorithm learns from these input/label pairs ($X$, $y$) and eventually will be able to predict the output based on some new input where you have no labelled data. This is known as **supervised machine learning**, as you provide the algorithm with your data and the correct answers. We will primarily be dealing with supervised machine learning in this course, as it is by far the most used type of machine learning.

## Sub-Categories of Machine Learning

### Supervised vs. Unsupervised Learning

The first thing to note, is that machine learning is broken down in to a few sub-categories. The first major distinctions is the distinction between **supervised machine learning** and **unsupervised machine learning**. In supervised machine learning we train on data where we have examples of our data with the corresponding answers (or **labels**, as they are called), as we just discussed. Unsupervised learning deals with the training of algorithms where we have inout data, but we **do not have any labels or answers** for this data. 

In this course, we are going to be dealing exclusively with supervised learning, as it by far the most common type of machine learning problem.

### Classification and Regression

Within the sub-category of supervised learning, there are also two major branches: regression and classification. 

Regression deals with predicting some continuous value. For example a price of a house, or a temperature.  

Classification deals with predicting a category like spam/not spam in the case of a spam filter, or malignant/benign in some medical application. There are other subcategories of classification, such as binary classification, where there are two possible prediction categories, or multi-class classification, where there can be many different prediction categories. Tomorrow we will train a neural network to predict an image class from a possible 1,000 classes.

The distinction is important for choosing the method you wish to use to train a prediction algorithm. Not all algorithms can do both regression and classification for example, but many can. For example neural networks can be trained as regression algorithms or classification algorithms. $k$-Nearest Neighbors ($k$-NN), however, is a algorithm for classification that assigns a label to an input data point based on the majority class of its $k$ nearest neighbors, and cannot be used for regression problems, by virtue of its design.

## Examples of Machine Learning Applications

Let's just think about what are possible machine learning algorithms (remember we are focussing on supervised learning): 

| Input ($X$)         | Output ($y$)              | Application         |
|---------------------|---------------------------|---------------------|
| email               | spam (0/1)                | Spam filter         |
| hand written digits | label (0, 1, ..., 9)      | Image recognition   |
| english             | german                    | Machine translation |
| house details       | house price (numerical)   | Regression          |

Notes:

- In the case of the email/spam example: this is known as a binary classification problem
- In the case of the hand written image example: this is a multi-class classification problem
- In the case of the house details: this is a regression problem, the output is a real number, the price of the house

In all of these examples, in a supervised learning setting, you will train your algorithm with input $X$ and the correct answers, $y$. For example, for the spam filter, you will want to supply it with examples of emails and whether or not it is a spam email or not. With enough of these examples, the algorithm will begin to learn what types of emails are spam and which are genuine.

Once a machine learning algorithm has been trained for spam filtering, you can then supply it with a new email (one it has never seen before), and it will try to predict the label (spam/not spam) based on this email's contents.

We should also note at this point, is that applied machine learning is very much an engineering discipline. In other words, given a particular dataset, with a stated goal, there is a best pratice set of decisions that can be made about how to tackle the problem. And there are a lot of decisions that need to be made. Do we select logistic regression or a support vector machine. Do we need more data. Do we need to train longer. Are we *over-fitting* or are we *under-fitting*. All these decisions can be made in a systematic way based on the situation. The goal of this course is to remove as much of the guesswork as possible, and for you to be able to make informed decisions about how to tackle a particular machine learning problem.

# Key Terminology

Before we go any further, let's cover some basic terminology and we will also cover some conventions we will use for writing our code. 

Let us define the following terms:

- **Features**: these are properties about the sample in question. In a house dataset, this would be square metres, does it have a pool, neighbourhood, etc.
- **Labels** or **targets** or **classes**: this is what you are trying to predict. They are also what you need to supply an algorithm in a supervised setting, during training. In the house price dataset, this is the price of the house.
- **Training data**: what we use to train our algorithm.
- **Validation data** and **Test data**: what we use to validate and test our algorithm. These are normally a subset of your total data. Generally we 

By convention your training data is stored in `X`, uppercase and your label data is stored in `y`, lowercase. 

Also, if you are splitting our data into training and test sets, you will see the following `X_train`, `X_test` will contain our training data, and `y_train`, and `y_test` will contain our corresponding labels. 

We will explain later exactly what the purpose of the training and test set data are.

# Machine Learning Lifecycle

Before we talk about any algorthimths. Let's first discuss a typical machine learning lifecyle.

After this we will discuss Sci-Kit Learn, which is Python's most used Machine Learning library. Sci-Kit Learn is a collection of many different algorithms, all with a common API. If you can use one algorithm in SciKit Learn, you can pretty much use them all.

## 1. Problem Definition
- Define your objectives and success criteria
- Frame the problem (classification, regression, clustering, is it a supervised problem or an upersupervised problem)

## 2. Data Collection
- Gather your data: how will you get it? Will it be delivered to you? In what form?
- Consider how much data you have, and the quality of the data, as this mau determine what algorithms you can use

## 3. Data Exploration & Analysis
- Perform what is known as exploratory data analysis (EDA)
- Visualize the distributions of the data (e.g. consider the distribution of your targets classes, e.g. malignant/benign in your dataset, for example)
- Identify outliers and missing values, which you may need to correct in the next phase

## 4. Data Preprocessing
- Handle missing values
- Remove or correct outliers
- Normalize/standardize features
- Encode categorical variables
- Create new features (feature engineering) by combining features
- Split data into train/validation/test sets

## 5. Model Selection & Training
- Select appropriate algorithms based on problem type (classificaton, regression, supervised, unsupervised)
- Train your first models, naively
- Perform hyperparameter tuning
- Perhaps train more complex models

## 6. Model Evaluation
- Carefully evaluate models on validation data and test data
- Avoid pitfalls when evaluating your models
- Does your model meet the requirements set out in step 1?

## 7. Model Deployment
- This is the final step
- Serialize and package the model in to a single file
- Implement API endpoints or create some user interface to your model (how should people use it?)
- We will discuss this step in much more detail tomorrow

We will try to cover as many aspects of this pipeline as possible in this session.

# Sci-Kit Learn

Now we will move on to the Sci-Kit Learn machine learning library.

If you are doing Machine Learning with Python, you will most likely eventually end up using Sci-Kit Learn for many of your machine learning tasks. 

Sci-Kit Learn is a library of machine learning tools, algorithms, and utilities, that covers classification, regression, and clustering (unsupervised learning). 

Because there are many algorithms that are implemented in SciKit Learn, we will cover the general usage of the package, and show a few examples of some of the algorimths. Most algorithms follow the same API conventions, so if you know how to train a few of the algorithms using Sci-Kit Learn, you will probably quite easily be able to use all of them without too much difficulty.

If we just take a look at the amount of algorithms covered by Sci-Kit Learn it is overwhleming:

<https://scikit-learn.org/stable/user_guide.html>

Covering all these different methods would be overwhelming and we would not have enough time to ever cover a fraction of them here. Instead I will talk about how use SciKit Learn **in general**. 

Most algorithms use a very similar API. In other words, if you have your data prepared in a way that one of these algorithms will accept it, then you can probably apply quite a few algorithms very easily. 

We mentioned earlier that common convention is that your data is contained in a data structure called `X` and your labels are contained in a data structure called `y`.

## Core Components

**Estimators**:
- All supervised models (classifiers, regressors) are estimators
- The fundamental training API consists of the `fit()` and `predict()` methods
- Convention:
    - `model.fit(X, y)` is used for training your algorithm
    - `model.predict(X)` is then used after training for prediction

**Metrics**:
- Classification: `accuracy_score()`, `precision_recall_fscore_support()`, `confusion_matrix()`, and most often used `classification_report()`
- Regression: `mean_squared_error()`, `r2_score()`

**Preprocessing**:
- `StandardScaler`, `MinMaxScaler`: For feature scaling
- `OneHotEncoder`, `LabelEncoder`: For categorical features
- `Imputer`: For handling missing values

**Model Selection**:
- `train_test_split()`: Splits data into training and testing sets
- `GridSearchCV()`: For hyperparameter tuning
- `cross_val_score()`: For cross-validation evaluation

**Models**: Organized by learning task
- Classification: `LogisticRegression`, `RandomForestClassifier`, `SVC`, etc.
- Regression: `LinearRegression`, `RandomForestRegressor`, etc.
- Clustering: `KMeans`, `DBSCAN`, etc.
- Dimensionality Reduction: `PCA`, `TSNE`, etc.

## Consistent API Pattern

To summarise, we can say that SciKit-Learn's functionality follows a consistent pattern throughout most of its built in algorithms.

Namely, most SciKit-Learn pipelines for a machine learning 

1. Initialize the estimator with some set of parameters
2. Prepare data using methods such as `train_test_split()`
3. Fit to training data (using `fit(X, y)`)
4. Predict on new data (using `predict(X)`) - new data generally being your test set.
5. Evaluate results with built-in metrics tools, such as `classification_report()` and `confusion_matrix()`

This consistent interface makes it easy to swap different algorithms while maintaining the same workflow structure.

Sci-Kit Learn's API is used so often, that other packages mimick it so that they can be used interchangebly with SciKit-Learn's other functions. For example, a well known Random Forest algorithm called XGBoost also follows the identical API. 

## Core Algorithms 

### Linear Models
- **Linear Regression**: Fits a linear relationship between features and target variables
- **Ridge & Lasso Regression**: Linear regression with L2 and L1 regularization respectively
- **Logistic Regression**: For binary and multi-class classification problems
- **SGD (Stochastic Gradient Descent)**: Implementations of linear models optimized with gradient descent

### Decision Trees
- **DecisionTreeClassifier/Regressor**: Tree-based models for classification and regression
- **Random Forest**: Ensemble of decision trees trained on random subsets of data and features
- **Gradient Boosting**: Sequential ensemble that builds trees to correct errors of previous ones
- **AdaBoost**: Focuses on difficult training examples by adjusting their weights

### Support Vector Machines
- **SVC/SVR**: Support Vector Classification/Regression with various kernels (linear, polynomial, RBF)
- **LinearSVC/LinearSVR**: Faster implementations for linear kernel cases

### Naive Bayes
- **GaussianNB**: For continuous data assuming Gaussian distribution
- **MultinomialNB**: For discrete count data (e.g., text classification)
- **BernoulliNB**: For binary/boolean features

### Nearest Neighbors
- **KNeighborsClassifier/Regressor**: Classification/regression based on k nearest neighbors
- **NearestCentroid**: Classification using the nearest centroid

### Clustering: Unsupervised
- **KMeans**: Partitions data into k clusters by minimizing within-cluster variance
- **DBSCAN**: Density-based clustering that can find arbitrary-shaped clusters
- **Hierarchical Clustering**: Builds nested clusters by merging or splitting

## Getting Data

Before we discuss any algorithms, we will discuss how to get data and how to properly prepare your data into a training set, validation set, and test set. We will also discuss why we would wish to split the data in this way.

## Datasets

Before we can train any algorithms, we will need some data. SciKit Learn's `datasets` module contains a number of sample datasets that can be used to test your code, or to benchmark an approach and so on. 

Benchmarking datasets are useful as you can compare your results of a new approach that you have undertaken, as you can directly compare metrics with the state of the art, etc. 

Often, benchmarking datasets have pre-defined training and test sets, so that you can compare your approach's evaluation metrics with the metrics from the identical test set of previous state-of-the-art approaches. 

So let's now have a look at the datasets API of SciKit-Learn.

Perhaps not surprisngly, the `sklearn.datasets` package contains the relevant functions and so on.

There are 3 main categories of the `sklearn.datasets` package:

1. Toy datasets
2. Real-world datasets
3. Data generators

Some of these are dataset **loaders**, that is the data is included in the software package, and can be immediately loaded into memory. Others are dataset **fetchers**, that is to say that the first time you load such a dataset, the dataset is downloaded from the internet and saved to a directory on your computer. This directory is checked every time you fetch such a dataset, and if it has already been downloaded in the past, it will use the previously downloaded copy. This directory is normally something like `/home/username/.sklearn/datasets` on a Linux system, and will be different under Windows.

Both **fetchers** and **loaders** all return a dictionary-like object with *at least* the following attributes:

- A `n_samples` $\times$ `n_features` matrix under the key `data`
- An array of length `n_samples` containing the targets, under the key `target`

Most of the datasets allow you to specify that you only wish to receive the data and no additional information (most datasets contain information about the class names, and so on), and this can be specfied using `return_X_y=True` when loading the dataset. Most datasets can also be returned as a Pandas DataFrame, you can specify the `as_frame=True` parameter.

On the other hand, the **data generator** functions create datasets according to your requirements. This is very useful to test a method or pipeline you have developed. For example, you can create a dataset which is gauranteed to be linerally seperable. If your approach does not work on such a dataset, you can be sure that something in your approach is not functioning.

The **data generator** functions all return a **tuple** of the form `(X, y)`:

- where `X` is a `n_samples` $\times$ `n_features` matrix, and
- where `y` is an array of length `n_samples`

We will look at generating some data later.

## Diabetes Dataset

As an example of a built-in dataset, which is part of SciKit-Learn by default, we will take a look at the diabetes dataset.

First we load the `datasets` module and then use the `load_diabetes()` function to get the data. This data is built-in, and does not need to be fetched from the Internet:

In [None]:
from sklearn import datasets
diabetes = datasets.load_diabetes()

The `diabetes` object that is retruned is a dictionary-like object. We can take a look at its keys, to see what kind of data is stored in the the object:

In [None]:
diabetes.keys()

We see that we have a `data` attribute, a `target` attribute, and a `DESCR` attribute, and so on.

We can just execute the object directly to see everything it contains:

In [None]:
diabetes

---

We can see the different attributes, such as the `data` and a `target` key/attributes that we spoke about earlier. The data stored in these attributes can be accessed using a key like a dictionary, such as `diabetes['target']`, or using the `.` notation. 

Let's have a look at the `DESCR` which should tell us about the dataset itself - note that we can see `DESCR` above, but by wrapping it in a `print()` function it will be displayed in a much cleaner way: 

In [None]:
print(diabetes.DESCR)

Most datasets built in to SciKit-Learn will have this `DESCR` attribute.

We can also take a look at the targets:

In [None]:
diabetes.target

These values are the values you would try to predict in a machine learning algorithm, while `data` contains the features for each patient.

We can preview the data here:

In [None]:
print(diabetes.data)

As we mentioned previously, `target` is an array of length `n_samples`, while `data` is an `n_samples` $\times$ `n_features` matrix:

In [None]:
print(f"Targets shape: {diabetes.target.shape}\nData shape: {diabetes.data.shape}")

The feature names are also available:

In [None]:
diabetes.feature_names

The values `s1`, `s2`, etc. are not very descriptive, however they are described in the `DESCR` attribute, e.g. `s1` is described as the total serum cholesterol.  

Most datasets can also be accessed as Pandas DataFrames, which is often easier to preview within environments like Jupyter.

Therefore, to request a DataFrame we set the `as_frame` parameter to `True`:

In [None]:
diabetes_df = datasets.load_diabetes(as_frame=True)

Now wecan preview the `data` attribute (which is now a Pandas DataFrame) using the `head()` or `tail()` functions:

In [None]:
diabetes_df.data.head()

Having a Pandas DataFrame instead of a Numpy 2D array, does make some things a bit easier, such as being able to select columns/features by their column name:

In [None]:
diabetes_df.data.bmi

Pandas includes other helper functions include `describe()` which prints some statistics about each feature in the DataFrame:

In [None]:
diabetes_df.data.describe()

Which you prefer depends on your preferences or use case. If you are working within Jupyter, it can often make sense to use DataFrames, however, if you are writing scripts with zero user interaction, then often Numpy is the better choice. 

If you are not interested in any of the additional information, such as the `DESCR` or anthing else, you can specify `return_X_y=True` parameter, which returns nothing but the raw data and the targets, which you can save in to `X` and `y` directly, as shown below:

In [None]:
X, y = datasets.load_diabetes(return_X_y=True)

In [None]:
X.shape

In [None]:
y.shape

This returns only the `X` and `y` data structures and nothing else, by default as Numpy arrays. You can also combine this with the `as_frame=True` to get `X` and `y` as DataFrames:

In [None]:
X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)
X.head()

## Fetching Data

As mentioned previously, SciKit-Learn also allows you to fetch data from the Internet. One such option is the OpenML repository, and in this section we will take a look at SciKit-Learn's API for interfacing with <https://openml.org>.

If we take a look at the website now, and search for something like 'mice', we will see the types of data available...

The website <https://openml.org> gives you access to over 6,000 datasets, and SciKit-Learn allows you to download them directly, and in a format that SciKit Learn can use immediately.

To do this we use the `fetch_openml()` function. 

First we import it from the `datasets` module, and then download the dataset we want using `fetch_openml()`:

In [None]:
from sklearn.datasets import fetch_openml

mice = fetch_openml(name='miceprotein', version=4)

If we access `mice` directly, we will see it follows the same format as the built-in `diabetes` dataset:

In [None]:
mice

---

Just as we saw with `diabetes` above, we have `data` and `target` structures available:

In [None]:
mice.data.shape

In [None]:
mice.target.shape

In the case of the `MiceProtein` dataset, there is a `details` attribute:

In [None]:
mice.details

To summarise, gathering data can be very tedious. Especially if data is provided in obscure formats, or proprietary formats that require converting to a format we can read with Python or SciKit-Learn. 

The `datasets` module within SciKit-Learn is a very convenient source of datasets, many of which are used as benchmarking datasets so that you can confirm that your methods work to a sufficient standard. It also allows you to pull data from online resources. 

Last, we will look at generating data. 

# Generating Data

Sometimes, however, you want a dataset with very specific properties to test a machine learning approach that you are trying. Here is where you might use a data generator instead.

Generators are part of the `sklearn.datasets` package, generally preceded with the word `make_...`. For example `datasets.make_classification()`. 

If we want to create a classification dataset, we can do so as follows:

In [None]:
from sklearn.datasets import make_classification

The `make_classification()` function is quite involved.

We can take a look at the documentation, to get a rough idea of what types of data can be generated using it:

In [None]:
make_classification?

---

Basically, `make_classification` creates a random $n$-class classification problem. 

If we generate a very simple dataset, we can visualise this easily. 

Using the default values, we can run the following:

In [None]:
X, y = make_classification(random_state=0)

print(X.shape)
print(y.shape)

As can be seen, `make_classification()` returns a tuple containing your data in `X` and your targets in `y`. This is exactly how most of the data is returned by many of the `dataset` module's functions.

The data stored in `X` contains 100 samples, each with 20 features.



In [None]:
import pandas as pd
pd.DataFrame(X)

Because this is generated data, the data doesn't contain feature names, and the data looks quite randnom.

We can plot a few of the features, to get an idea how it looks:

In [None]:
import matplotlib.pyplot as plt

plt.scatter(X[:,0], X[:,1], c=y)

Remember from earlier that `X[:,0]` returns all rows from column 0 and that `X[:,1]` returns all rows from column 1. We use `c=y` to define the colours of the classes, and colour the points on the plot.

This means the plot above only plots feature 0 versus feature 1.

We a can define exactly how we want to create the data, for example we can say that we want 2 features in the dataset, and both are informative. This should then produce a dataset that should be trivially seperable using any basic algorithm. 

Let's create this more simple dataset"

In [None]:
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0, n_informative=2, random_state=2, n_clusters_per_class=1)
plt.scatter(X[:,0], X[:,1], c=y);

This looks much better and should be linearly seperable. 

As you can see, you can create a dataset to match your exact needs, and thereby create data that should be, for example, linearly classifiable. 

You can of course also make data for regression tasks, unsupervised tasks, inlcude noise, and so on.

## Summary of Data Gathering

We have seen several ways in which we can gather data. However, getting data is really only half the story. Once data has been gathered, we normally need to prepare it in some way and often data needs to be 'cleaned'. Exactly what this means will be discussed in the next section.


# Data Preprocessing

Let's briefly discuss the topics of data preprocessing.

What is data preprocessing? It is merely getting a dataset ready to be inputted in to a machine learning algorithm for training. Exaclty what you need to do often actually depends on the type of algorithm you are training. For example, one simple type of data cleaning is handling missing values. Some algorithms cannot handle missing data for example, while others can. Therefore, this step would depend on the algorithm you want to train.

Other operations include scaling your data, using feature scaling. Sci-Kit Learn provides numerous helper functions for such tasks, as we will see here.

## Cleaning Data

Before we scale or normalise the data, we might want to clean it first. Threfore let's clean a very simple dataset right now to see some methods that you might use.

Note that it is often much easier to use Pandas for such tasks. You can always convert your dataset to Numpy if required later.

Let's work with a dataset known as the Cleveland Heart Disease Dataset, a well known dataset used for benchmarking.

With Jupyter we can view the data directly, which we will do now! 

**Note** that we know from looking at the dataset, that missing values are represented by `?` characters.

Let's load the dataset and then preview it. 

Pandas provides methods for loading CSV files, Excel files, and so on. In this case we are loading a CSV file:

In [None]:
import pandas as pd

heart_disease = pd.read_csv('./data/cleve.mod.mdb.csv', na_values='?')

Now that it has loaded, we can use the `info()` function to have a look at the data:

In [None]:
heart_disease.info()

We have 303 rows/patients, and 14 features. The tpyes of the features are also visible. For example, Age is a numerical field, while Blood Sugar Less 120 is a boolean field.

You may also see that Number of Vessels Coloured has 298 non-null values...

We can also preview the data using `head()` or `tail()`:

In [None]:
heart_disease.head()

In [None]:
print(heart_disease.shape)

### Cleaning

Checking for missing values is one of the **first things** we will do when examining a dataset.

We can do this using the `isnull()` function, and print the missing values per feature/column.

In [None]:
print("\nMissing values per column:")
print(heart_disease.isnull().sum())

We can look at the rows using the `isna()` function:

In [None]:
heart_disease[heart_disease.isna().any(axis=1)]

Missing data is represented by a special typed called `NaN` (not a number). We see the rows where Number of Vessels Coloured and Thalassemia have missing values.

Many algorithms cannot handle missing data (although some can), and therefore we have to handle this somehow.

An easy way to do this would be to simply drop those rows. After, only a few rows have missing data. 

However, we can also replace the missing values by, let's say the mean value for that column. 

So, for example, a patient with a missing age would get the average age of all the patients.

In Pandas we can do with in one line:

In [None]:
heart_disease = heart_disease.apply(lambda x: x.fillna(x.median()) if x.dtype.kind in 'ifc' else x)

In this case, we are using the `fillna()` function to replace NAs with the median, if the type is `i` (integer), `f` (float), or `c` (complex). The `apply()` function runs a function across the enture dataset, and we use a lambda function to define this inline.

Now let's see if we have any missing values:

In [None]:
print(heart_disease.isnull().sum())

We can see that Thalassermia still has 2 missing values. We can check again check these rows using the `isna()` function:

In [None]:
heart_disease[heart_disease.isna().any(axis=1)]

The reason that these were not replaced is because Thalassemia is not a numerical value, and therefore cannot be replaced using a mean or median. 

We could replace them with the most frequently occuring value, however in this case we will do what is also often done, and we will remove the rows entirely. There are only 2 rows out of over 300 so it should not impact our training.

Dropping rows with missing data is so common, there is a function that does exactly this called `dropna()`:

In [None]:
heart_disease =  heart_disease.dropna()

We can check if we have an missing values again:

In [None]:
print(heart_disease.isnull().sum())

In [None]:
heart_disease.info()

Missing values are now all removed.

You may have noticed that we have a Status and Status Type field. Normally by convention the last column is our target, but in this case Status is our target and Status Type is irrelevant, as it does not tell us anything about the patient (it relates to how the data was collected and is therefore useless for the actual classification of the patient).

Therefore we will drop this column altogether. We can use the `drop()` function for this, which allows you to specify what you want to remove, and which axis this is on. Remember axis 0 are our rows, and axis 1 are the columns:

In [None]:
heart_disease.drop('Status Type', axis=1, inplace=True)

By saying `inplace=True`, we modify the `heart_disease` DataFrame directly, otherwise Pandas would return a new DataFrame with the data removed. Many Pandas functions do this, so you need to be careful about this.

Therefore, we should now see one less column/feature in the dataset:

In [None]:
heart_disease.shape

In [None]:
heart_disease.info()

We may also have noticed that the **Status** column contains the values 'healthy' and 'sick' and relates to **presence** or **absence** of heart disease. 

Maybe, we decide that the terms 'healthy' and 'sick' are not how we would like to phrase it, so we can replace these values with 'absent' and 'present' instead.

For this we will use the convenient `map()` function:

In [None]:
heart_disease['Status'] = heart_disease['Status'].map({'healthy': 'absent', 'sick': 'present'})

Preview our data once again:

In [None]:
heart_disease.head()

This is just to demonstrate how to replace values quickly, in fact once we input the values into an algorithm, these strings will be replaced by numerical/boolean values anyway. 

You will also notice, that several of the columns in the dataset contain categorical values in the form of strings, such as gender.

We can use `unique()` to see how many values a column has:

In [None]:
print(heart_disease.Gender.unique())

For input in to a training algorithm, we cannot use such categorical values. We need to replace these with numerical values.

We can replace values quite easily, as follows:

In [None]:
heart_disease.Gender = heart_disease.Gender.replace('M', 0)
heart_disease.Gender = heart_disease.Gender.replace('F', 1)

Here we use Pandas' `replace()` function to do this. This is perhaps even easier than using `map()`, from above.

**Note**: if you see a warning when we run this function, we can safely ignore it for now - it is merely informing us about a future change to the functionality of the `replace()` function.
 
We can look at our data again:

In [None]:
heart_disease.head()

The gender field only had 'M' and 'F' values, so we could just replace these manually using two lines of code. However, if a field was to contain many different values, you can loop over them and replace them. 

We will do this now to demonstrate how this works.

Again we use the `unique()` function:

In [None]:
cpt = heart_disease['Chest Pain Type'].unique()
print(cpt)

We now have a list containing the unique values for this field. 

We can replace them with numerical values as follows:

In [None]:
for index, value in enumerate(cpt):
    heart_disease['Chest Pain Type'] = heart_disease['Chest Pain Type'].replace(value, index)

And once again preview our data:

In [None]:
heart_disease.head()

However, this method is quite manual and not very clean, and we need to ensure we keep our `cpt` list otherwise we will not be able to trace the numerical values back to the categories. A better approach is to use one of SciKit-Learn's encoders.

So instead, for the remaining categorial features, let's use SciKit-Learn's `OridnalEncoder` to do this.

For demonstratin purposes, let's first do this to 'Resting ECG' column to see how it works.

Let's import `OrdinalEncoder` and then create a new object called `encoder`, telling it to transform the 'Resting ECG' feature:

In [None]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder().set_output(transform="pandas")

# Encode the Resting ECG column
resting_ecg_encoded = encoder.fit_transform(heart_disease[['Resting ECG']])

This is now done, and we can take a look at our encoded column:

In [None]:
resting_ecg_encoded.head()

As you can see, the categorical values have been given numerical values for the 'Resting ECG' column.

In this case, we did this just on one column, to demonstrate how it works - but we can apply the encoder to multiple columns at the same time and perform this conversion with just a couple of lines of code. So let's do this now.

First, let's create a new encoder, and then tell it the categorical featurres that we wish to encode as numerical values:

In [None]:
encoder = OrdinalEncoder().set_output(transform="pandas")

encoded = encoder.fit_transform(heart_disease[['Resting ECG', 'Slope', 'Thalassemia', 'Status']])

That's it, in just a few lines of code we have converted 4 columns in to numerical features!

We can take a look at the columns now:

In [None]:
encoded

Now that we see they look good, we can replace our columns with the encoded columns in one line:

In [None]:
heart_disease[['Resting ECG', 'Slope', 'Thalassemia', 'Status']] = encoded

Let's preview our data once again:

In [None]:
heart_disease.head()

Now all our data is numerical or Boolean (which can be handled as numerical data by Sci-Kit Learn, as they simply represent 0 and 1 anyway).

Note, a very useful feature is that the `encoder` object contains our original categories, which we can reference later if we needed to know what 1 means in the 'Status' field, for example:

In [None]:
encoder.categories_

### Ordinal vs. Nominal 

We have assumed in each case above that there is an inherent order to the data, hence using an Ordinal encoder. 

A few things to note however: 

- **First**, we would need to check if these fields are indeed ordinal or nominal. For example, the Gender field has been encoded as Ordinal, meaning there is some order to gender where female, which is encoded as 1, is somehow 'worth more' than male which is encoded as 0. This is true for the Resting ECG field, as `norm` is *better* than `hyp` which is *better* than `abn`.
- **Second**, we did not actually specify the order for the ordinal encoder, we would normally do so by telling the encoder that `norm` should be `0`, `hyp` should be `1`, and `abn` should be `2`. We didn't do this in order to simplify the code somewhat.

Therefore, let's fix the Gender field using a nominal encoder. Nominal encoding assumes no ranking of values. Other fields like this might be city, or state, etc.

To do nominal encoding you need to actually create a new feature for each of the possible values in the original field. 

So, instead of Gender being encoded like so:

|   | Gender |
|---|--------|
| 1 | 1      |
| 2 | 0      |
| 3 | 0      |

We would encode it as so:

|   | Gender_Female | Gender_Male |
|---|---------------|-------------|
| 1 | 1             | 0           |
| 2 | 0             | 1           |
| 3 | 0             | 1           |


To do this, we create a nominal encoder which is called a `OneHotEncoder` in machine learning parlance:

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False).set_output(transform="pandas")

Now perform the encoding, and preview it:

In [None]:
encoded_gender = encoder.fit_transform(heart_disease[['Gender']])
encoded_gender.columns = ["Gender Female", "Gender Male"]

In [None]:
encoded_gender

Now we can insert our new columns. First we drop Gender:

In [None]:
heart_disease.drop('Gender', axis=1, inplace=True)

In [None]:
heart_disease

And add our new columns using the `insert()` function:

In [None]:
# Insert the first column at index 1
heart_disease.insert(loc=1, column='Gender Female', value=encoded_gender['Gender Female'])

# Insert the second column at index 2
heart_disease.insert(loc=2, column='Gender Male', value=encoded_gender['Gender Male'])

In [None]:
heart_disease

## Scaling and Normalising

Now that we have dealt with our categorical values, we need to scale some of our fields. 

Scaling means that we alter our numerical so that it looks like normally distributed data, with a mean of 0 and a standard deviation of 1. Some methods scale all values between 0 and 1, while others scale between -1 and 1. 

Why is this done? Quite a few algorithms will behave non-optimally if numerical data is not scaled, because different features might be deemed as more important because they have higher values. For example, in our heart disease dataset, we might incorrectly consider Cholesterol are having a larger impact thant Rest Blood Pressure because the values tend to be much higher. If we scale the values between 0 and 1, then each feature looks as important as any other feature. 

SciKit Learn has a large number scaling features built in to its `sklearn.preprocessing` module. 

Here we will use the `StandardScaler` to demonstrate how to do feature scaling.

First let's import `StandardScaler` and create a `scalar` object:

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().set_output(transform="pandas")

Using this `scaler` we will now scale one feature, namely the Cholesterol column, to demonstrate its usage:

In [None]:
cholesterol_scaled = scaler.fit_transform(heart_disease[['Cholesterol']])

Now the data has been scaled, we can look at it side by side with the original data:

In [None]:
pd.concat([cholesterol_scaled, heart_disease.Cholesterol], axis=1)

We can replace our Cholesterol field with our new scaled version as follows: 

In [None]:
heart_disease.Cholesterol = cholesterol_scaled

In [None]:
heart_disease

We can do the same with Age, Resting Blood Pres, Max Heart Rate, and Old Peak.

To shorten the code, we can just apply the `fit_transform()` and replace the original column in one go:

In [None]:
heart_disease['Age'] = scaler.fit_transform(heart_disease[['Age']])
heart_disease['Resting Blood Pres'] = scaler.fit_transform(heart_disease[['Resting Blood Pres']])
heart_disease['Max Heart Rate'] = scaler.fit_transform(heart_disease[['Max Heart Rate']])
heart_disease['Old Peak'] = scaler.fit_transform(heart_disease[['Old Peak']])

And then preview our data again:

In [None]:
heart_disease

### Outliers

We will not deal with this directly, but another cause of issues, even if you have scaled the data, are outliers. 

These are values which are far outside of the mean value for a particular feature. 

Going back to the scaled Cholesterol feature, we can plot the distrubution of the data:

In [None]:
cholesterol_scaled.plot.hist();

Seems we might have an outlier or two at around the 6 mark which has skewed things somewhat. Maybe a boxplot would tell us more?

Pandas has a a large number of plotting tools, including the ability to generate a boxplot very easily:

In [None]:
cholesterol_scaled.plot.box();

As a reminder, the green line shows the median value, meaning 50% of the data is above that line and 50% below. The circles show outliers. The blue lines show the upper and lower quartiles.

There are several techniques with how to deal with outliers, however we will not deal with them in this example. If you want to account for outliers, that take a look at `QuantileTransformer` or `PowerTransformer`, see <https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html>

**Note**, that not all algorithms require scaling. Some, however, absolutely require that all features centre around 0. Check the documentation.  

### Summary of 
We have seen how we can scale and encode features quite easily. Which type of encoder you use depends on the data and whether it is nomninal data, such as gender, or ordinal data, such as Chest Pain Type, which clearly has an order. 

Once you are used to the SciKit Learn `preprocessing` module, you will see that most scalers, normalisers, and encoders have a very similar API, and that you will be able to transform entire datasets in just a few lines of code. 

### Final Steps

Finally, our dataset is ready and can be saved to the conventional Numpy format for SciKit Learn.

First we create our label array, which is the Status field, and then drop it from the DataFrame, and also take a look at the data:

In [None]:
import numpy as np

y = np.array(heart_disease["Status"])
heart_disease.drop('Status', inplace=True, axis=1)

In [None]:
y

And now convert our remaining data in to a 2D Numpy array:

In [None]:
X = np.array(heart_disease)

In [None]:
X

The next step normally would be to create our train/validation/test splits, and then train our model. However, we will do this later, after we discuss a concrete machine learning algorithm first.

## Linear Regression

The first concrete, actual algorithm we will look at is called linear regression and is probably the simplest machine learning algorithm you will encounter.

Using this simple example we will take a look at model evaluation, which is how you evaluate how well your model is performing, and how well it will perform on **new data**. This is crucial to understand how well your model might work after it has been trained, and is used on new, unseen data.

So, imagine the following scenario - you have a dataset of house prices and area in square metres.

We will generate this datset, but creating a linear relationship between size of the house and price of the house, and add some noise to the data.

We do so as follows:

In [None]:
import pandas as pd
import numpy as np

# Set the random seed for reproducibility
np.random.seed(0)

# Specify the number of samples
num_samples = 100

# Generate 100 random areas in square meters (from 50 to 250) 
areas = np.random.uniform(50, 250, num_samples).astype(int)

# Generate house prices with a base price, added by a linear function of area (with some noise)
base_price = 100000  # base price for the smallest house
price_per_sqm = 2000  # price per square meter
noise = np.random.normal(0, 30000, num_samples).astype(int) # adding some noise to make it more realistic

# Our prices are therefore the base price, plus the area * price per square meter, plus the noise
prices = base_price + areas * price_per_sqm + noise

# Create a DataFrame
data = pd.DataFrame({
    'Area (sqm)': areas,
    'Price': prices
})

data

We have saved the data as a Pandas DataFrame.

To get a better idea of how the data looks, we can plot it. 

We will use the `scatter()` function from Matplotlib to plot the data:

In [None]:
# Plotting the data
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.scatter(data['Area (sqm)'], data['Price'], alpha=0.5)
plt.title('House Prices vs Area')
plt.xlabel('Area ($m^2$)')
plt.ylabel('Price (€)')
plt.grid(True)
plt.show()

So what do we see?

We see a more or less linear relationship between the area of the house and the price. 

The bigger the house, the bigger the price. 

Mote also there may be many other factors that we are not aware of, such as neighbourhood, or if the house has a pool or not, and so on, which would normally mean we would not see such a linear relationship. 

What we will do now is to train a Linear Regression algorithm, which will try to fit a line to this data as well as possible, and try to capture the relationship between area and price.

Do not worry about the details of the code below for now. Later will train an algorithms line by line. 

For now we will just train the model on the data we have:

In [None]:
from sklearn.linear_model import LinearRegression

# Reshape data for modeling
X = data['Area (sqm)'].values.reshape(-1, 1)
y = data['Price'].values

# Create and fit the model
model = LinearRegression()
model.fit(X, y)

# Predict values for the same input
predictions = model.predict(X)

# Calculate the errors
errors = y - predictions

# Plotting the data and the linear fit
plt.figure(figsize=(10, 6))
plt.scatter(data['Area (sqm)'], data['Price'], alpha=0.5, label='Actual Data')
plt.plot(data['Area (sqm)'], predictions, color='red', label='Linear Fit (Linear Regression)')
plt.title('House Prices vs Area')
plt.xlabel('Area ($m^2$)')
plt.ylabel('Price (€)')
plt.legend()
#for i in range(len(X)):
#    plt.vlines(X[i], min(predictions[i], y[i]), max(predictions[i], y[i]), color='gray', alpha=0.5)
plt.grid(True)
plt.show()

Once we have fit the model to the data, you can use it to predict the prices of new houses. 

If you wanted to sell your house today, and it has 150m2 area, we could predict a selling price of about 400,000 based on the line above.

How is the line actually fit, however? 

Well, this is a supervised algorithm, and therefore the error for any predicted line can be calculated by measuring the difference between the prediction, based on the line, and the actual price. 

This is done by measuring the distance between the line and the true value, and is called a residual.

The line is moved around until the sum of the errors (residuals) is minimised.

We can plot these residuals easily enough:

In [None]:
# Plotting the data and the linear fit
plt.figure(figsize=(10, 6))
plt.scatter(data['Area (sqm)'], data['Price'], alpha=0.5, label='Actual Data')
plt.plot(data['Area (sqm)'], predictions, color='red', label='Linear Fit')
plt.title('House Prices vs Area')
plt.xlabel('Area ($m^2$)')
plt.ylabel('Price (€)')
plt.legend()
for i in range(len(X)):
    plt.vlines(X[i], min(predictions[i], y[i]), max(predictions[i], y[i]), color='gray', alpha=0.5)
plt.grid(True)
plt.show()

So the algorithm can get a numerical value for how well a line fits the data, by summing the errors. It then tweaks this line until it minimises this error.  

One such way to get a numerical value for the error is the the mean squared error loss:

$$
MSE = \frac{1}{N}\sum_i^N (y_i - \hat{y}_i)^2
$$

We will not get much more in depth about how algorithms work in this course. However, knowing how a basic algorithm works should help us later when we are dealing with other more complex methods.

## Non-Linear Data

So this is fine for linear data, but what about non-linear data? 

As the name LinearRegression suggests, this algorithm tries to find a linear reltioship in the data.

This might now always be the case of course, so let's generate a not quite linear dataset for the house prices:

In [None]:
num_samples = 100

# Generate synthetic data
square_metres = np.linspace(50, 300, num_samples).astype(int) # Area in square metres
noise = np.random.normal(0, 20000, square_metres.shape) # Noise to add some variability
noise = np.absolute(noise)
prices = ((square_metres ** 2) * 3 + noise).astype(int) # Non-linear relationship, prices rise quickly as area increases

# Create a DataFrame
data = pd.DataFrame({
    'SquareMetres': square_metres,
    'HousePrice': prices
})

# Show the first few rows of the DataFrame
print(data.head())

# Plot the data to visualize the relationship
plt.figure(figsize=(10, 6))
plt.scatter(data['SquareMetres'], data['HousePrice'], alpha=0.6)
plt.title('House Prices vs. Area')
plt.xlabel('Area ($m^2$)')
plt.ylabel('Price (€)')
plt.grid(True)
plt.show()

Now we can see the relationship is not quite linear, it seems that as area increases, price increases more quickly. 

Of course, we can still fit a straight line to this:

In [None]:
# Reshape the data
X = data['SquareMetres'].values.reshape(-1, 1)
y = data['HousePrice'].values

# Initialize the linear regression model
model = LinearRegression()

# Fit the model
model.fit(X, y)

# Predictions
y_pred = model.predict(X)

#Errors
errors = y - y_pred

# Plot the data and the linear fit
plt.figure(figsize=(10, 6))
plt.scatter(data['SquareMetres'], data['HousePrice'], alpha=0.6, label='Actual Data')
plt.plot(data['SquareMetres'], y_pred, color='red', label='Linear Fit')
plt.title('Linear Fit to Non-linear Data')
plt.xlabel('Area in Square Metres')
plt.ylabel('House Price')
plt.legend()
plt.grid(True)
# Error bars
for i in range(len(X)):
    plt.vlines(X[i], min(y_pred[i], y[i]), max(y_pred[i], y[i]), color='gray', alpha=0.5)
plt.show()

# Model coefficients
intercept, slope = model.intercept_, model.coef_[0]

This does not look bad, however it is difficult to judge visually. What we can do is get a measure for how well the line fit. In this case, we will use the `score()` function which returns the $R^2$ score for the fit.

The LinearRegression model above has been saved as the variable `model` and we can get its score using the `score()` function, which by default is the $R^2$ score. This is a score between 0 and 1, where 1 is a perfect fit to the data.

In [None]:
model.score(X, y)

This score looks quite respectable (1.0 would be a perfect fit). Perhaps though we could improve the score with a curve rather than a line, hoping that the curve will fit the data more closely and therefore result in a better score.

So to do this we can now try to fit a polynomial to the data, in other words a non-linear curve:

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# Set the degree of the polynomial to 2 for quadratic fitting
degree = 2

# Create a pipeline that first transforms the features into polynomial features, then fits a linear model
poly_model = make_pipeline(PolynomialFeatures(degree), LinearRegression())

# Fit the model
poly_model.fit(X, y)

# Predictions using the polynomial model
y_poly_pred = poly_model.predict(X)

# Errors
errors = y - y_poly_pred

# Plot the data and the polynomial fit
plt.figure(figsize=(10, 6))
plt.scatter(data['SquareMetres'], data['HousePrice'], alpha=0.6, label='Actual Data')
plt.plot(data['SquareMetres'], y_poly_pred, color='red', label='Polynomial Fit')
plt.title('Polynomial Fit to Non-linear Data')
plt.xlabel('Area in Square Metres')
plt.ylabel('House Price')
plt.legend()
plt.grid(True)
plt.show()

This looks at first glance to be better. Of course, the best way is to compare the metrics, once again we calculate the $R^2$ score:

In [None]:
poly_model.score(X, y)

This results in an even higher $R^2$ score that our straight line fit. So the curve fit the curved data better than the line!

So the question you might ask is, how do you select the appropriate type of line? 

For example, we could increase the degree of the polynomial above and it may fit the data even better - the degree of the polynomial increases the complexity of the curve, and could fit the data even better. In fact, eventually you could add enough complexity to a curve so that it would fit the data exactly. But is this what we want to do?

So this leads us to the idea of over-fitting and under-fitting, which we will discuss next. 

---
# Over and Underfitting

- What is over/under fitting
- How do we spot it
- How do we prevent it

## What is over/under fitting? 

**Overfitting** occurs when a model **learns the training data too well**, capturing noise or random fluctuations in the data. All data contains some degree of noise. As a result, an overfitted model performs **well on the training data** but will perform **very poorly when it is applied to new data**. This is a classic sign of an overfit model. Overfitting typically happens when a model is too complex, for example a polynomial with a high degree (we will see examples again later). Complex models have a higher capacity to fit data, but they can also fit to the noise. A Linear Regression model is not complex enough to overfit to the noise of the datasets that we created above, for example.

**Underfitting** occurs when a model is too simple to capture the underlying structure of the data. In this case, the model **performs poorly both on the training data and on any new data**. Underfitting often happens when the model is too basic or lacks the required complexity to represent the underlying relationships in the data.

Let's plot some examples of overfitting and underfitting.

We will use simulated data and train 3 models. One model will not have the complexity to capture the data't structure and will overfit, another model will fit the data too well, including the data's noise, and will overfit, and one model will fit the data about right.

Let's train the models and plot their curves now:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Creating a dataset with some noise
np.random.seed(0)
X = np.linspace(0, 10, 100)
y = np.sin(X) + np.random.normal(scale=0.5, size=X.shape)

# Underfitting model (Polynomial of degree 1)
underfit_model = np.poly1d(np.polyfit(X, y, 1))

# Well-fitted model (Polynomial of degree 3)
well_fit_model = np.poly1d(np.polyfit(X, y, 3))

# Overfitting model (Polynomial of degree 15)
overfit_model = np.poly1d(np.polyfit(X, y, 15))

# Plotting the data and the models
plt.figure(figsize=(12, 6))
plt.scatter(X, y, label='Data')
plt.plot(X, underfit_model(X), label='Underfit Model (Degree 1)', color='green')
plt.plot(X, well_fit_model(X), label='Well-Fit Model (Degree 3)', color='blue')
plt.plot(X, overfit_model(X), label='Overfit Model (Degree 15)', color='red')
plt.legend()
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Demonstration of Underfitting, Well-Fitting, and Overfitting')
plt.show()

We can also plot these are 3 seperate plots, as it might be easier to see:

In [None]:
# Plotting the underfitting model
plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1)
plt.scatter(X, y, label='Data')
plt.plot(X, underfit_model(X), label='Underfit Model (Degree 1)', color='green') 
#plt.legend()
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Underfitting (Degree 1)')

# Plotting the well-fitting model
plt.subplot(1, 3, 2)
plt.scatter(X, y, label='Data')
plt.plot(X, well_fit_model(X), label='Well-Fit Model (Degree 3)', color='blue') 
#plt.legend()
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Well-Fitting (Degree 3)')

# Plotting the overfitting model
plt.subplot(1, 3, 3)
plt.scatter(X, y, label='Data')
plt.plot(X, overfit_model(X), label='Overfit Model (Degree 15)', color='red') 
#plt.legend()
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Overfitting (Degree 15)')

plt.tight_layout()
plt.show()

## How do we Notice Overfitting and Underfitting?

How we do know if we are overfitting or underfitting?

This is normally done by splitting our data in to training data and test data. 

We experiment with building our models using the training set, and then we test the model on a test set which the model has never seen before. 

Then, we analyse the performance of the model on both the training set, and the test set:

- If the model performs well on the training data, but performs poorly on the test data you have probably overfit the model to the noise of the training set. You need to reduce the model complexity (polynomial of smaller degree in the example above).
- If the model performs poorly on the training set and test set, then you may be underfitting, in which case you need to increase the model complexity.

Splitting the model into a training set and test **simulates** the scenario of having new, unseen data. The test set data is never seen by the algorithm during training. We will use the training set to train our model, and once this is trained, we can test the model using the test set, which the model has never seen before. 

Looking at the metrics of the models on the training set and the test will help us decide if we are over-fitting under-fitting.

Later we will see examples of using training sets and test sets.

### Creating a Train and Test Split

SciKit-Learn's `model_selection` module contains functions, such as `train_test_split()` which make it very easy to create randomised train/test splits, as well as a number of other useful tools for creating training and testing datasets. 

Our simulated data from above is stored in two structures, `X` and `y` (which is the convention for naming data), where `X` contains the data and `y` contains the labels.

To create a training set and test set we use `train_test_split()`:

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True, test_size=0.2)

This will shuffle your data (and shuffle `X` and `y` in unison, which is obviously very important) and create your training and testing splits. The `test_size` parameter defines the ratio of training to testing data.

We will see a live example of `train_test_split()` being used later.

We will also discuss further training/test split strategies at the end of this seminar also, including a method known as cross validation.

We can confirm the split as follows:

In [None]:
print(f"""Original dataset (X) size: {len(X)}
Train data (X_train) size: {len(X_train)}
Training labels (y_train) size: {len(y_train)}
Test data (X_test) size: {len(X_test)}
Test labels (y_test) size: {len(y_test)}""")

We can also confirm it has been shuffled. 

First look at the original data:

In [None]:
X

And now the shuffled data:

In [None]:
X_train

Once you have these data structures, we will actually train our algorithms using the training data in `X_train` and and test the trained models using `X_test`. The model has never seen the data in `X_test` and it should show us if we are overfitting or underfitting to the data.

We will discuss more concretely how to evaluate models in much more detail later tomorrow.

Now we will discuss an important aspect of train/test splits, and a technique called Cross Validation. 

# *k*-Fold Cross Validation

One issue with using a single train/test split, is that when you split the data, you may get an 'easy' **test set** at random. 

Therefore, your model performs well on this training set, and also very well on the test set, but then in a real-world setting the model suddenly underperforms.

Equally, you might get a very difficult test set. How might a test set be difficult? Let's say you have outliers in the dataset, a 'hard' test set might end up with all the outliers in the test set. This means the model never saw any outliers during training, and now encounters from suddenly during testing.

Therefore, your model might perform very poorly on this test set, and you might think that perhaps your model is simply not very good, when in fact it may well be down to this train/test split.

What is the solution to this? 

We could repeat the experiment several times, with different training / test splits.

If the model performs well across a few different training and testing splits, you can argue that the model is more robust. 

How might you do this? 

We could manually create *n* number of training and test splits, let's say *n*=3, and run the experiment *n*=3 times. Again, if the performance of the model on the test sets are stable, it implies that your training strategy is robust and perhaps you final trained model would generalisable to real world data.

This is fine, but the best way is to systematically split the data in to *k* folds, where each fold is a distinct 80/20 split, for example: and 80/20 split implies *k*=5. Commonly you will see 10-fold cross validation, where you will have 90/10 splits or 5-fold cross validation where you will see 80/20 splits.

![CV](./img/cross-validation.png)

*Source*: <https://www.mltut.com/k-fold-cross-validation-in-machine-learning-how-does-k-fold-work/>

This is known as *k*-Fold Cross Validation.

Note the following:

- Every sample is eventually tested against
- Similarly, you use all your data for training. In a scenario with limited data, this is an advantage.
- In a *k*-fold you wil train *k* models and get *k* metrics

Generally, after performing the experiment *k* times, you would report the results of all *k* models, noting if there is large variance between the the *k* different models.

What are some of the disadvantages of cross validation?

- Time consuming, as you need to train *k* models
- Not possible for very large datasets: for example, for image analysis with very deep networks, training can take 1 week. 10 fold cross validation is not feasible in this scenario

## 5-Fold Cross Validation Example

Here we will perform 10 fold cross validation on the a small dataset known as the Iris dataset.

To do this we will use Sci-Kit Learn's `cross_val_score()` function, which handles all of the train/test splitting, the training, and simply returns the scores of all 5 folds:

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn import svm
from sklearn.datasets import load_iris

model = svm.SVC(kernel='linear', C=1, random_state=42)

X, y = load_iris(return_X_y=True)

scores = cross_val_score(model, X, y, cv=5)

for fold, score in enumerate(scores):
    print(f"Fold: {fold+1}: score: {score}")

What you see here are the scores for 5 different models. Each trained on a different train/test split.

What you should look out for is that each of the models performs more or less the same, meaning your training strategy is not sensitive to the test data that it gets, meaning it should be generalisable when confronted with real world data.

Now we will discuss how to choose the machine learning algorithm.

---
# Choosing a Machine Learning Algorithm

Now that we have no issue getting and generating data, and we have an overview about how we might evaluate a model, let's see how you actually choose the algorithm for your particular task.

Sci-Kit Learn is a comprehensive package, and contains many dozens of algorithms for training models. 

As we said previously, we mainly discuss supervised learning in this course, however the API used for training models in Sci-Kit Learn is basically the same for most of the algorithms.

Because there are so many algorithms, you might wonder, how do I even choose an algorithm for my particular task. The following flowchart provides an overview:

![Sci-Kit Learn Flowchart](./img/scikit-flowchart.png)

This chart is actually interactive on the Sci-Kit Learn website, and clicking on each of the nodes brings you to the documentation for each of the particular algorithms. See the 'cheat sheet' here: <https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html> 

Because the focus of this course is not really about the algorithms per se, we will discuss the API that Sci-Kit Learn uses for training algorithms in general. 

These will differ from algorithm type to type, so for example clustering algorithms are unsupervised, and therefore do not accept a label parameter. These differences are clear from the documentation.

As a first step, let's take classification as an example so that we can learn about the general look and feel of the API. In this case will take a look at the Suport Vector Machine classifier. 

Explaining how SVMs work is out of the scope of this seminar, but what we can say is that it is very widely used algorithm that performs well on a variety of tasks. It can be used for both linear and non-linear data. It can be used for both classification and regression tasks. And it can be used for both binary and multi-class classification.

The general procedure for training any algorithm in SciKit Learn is to initialise an object of the algorithm in question, which will contain all the parameters that are specific to that algorithm.

---
# Model Persistence 

Once you have a trained model, often you will want to deploy this model or use it some upstream task, or just save it to disk to be opened later, as perhaps it took many hours to train. 

In order to save models, SciKit Learn uses Python's pickle format. This is a serialised binary format. This means that, for example, you can serialise any Python object and save it to disk. Therefore to save a trained model, we just use default Python tools, which SciKit-Learn uses internally anyway. 

We have a model above that we trained for the 5-fold Cross Validation example, stored in the variable `model`. We can save this to disk easily, using Pyton's `pickle` module.

This is done as follows:

In [None]:
import pickle

# Train our model above on the entire training set
model.fit(X, y)

pickle.dump(model, open('model.pickle', 'wb'))  # wb = write as binary

The model has now been saved to the file `model.pickle`.

This `model.pickle` can now be transferred elsewhere, sent to another party, stored for later use, or just opened in some other application's workflow. We will see this tomorrow, when we will integrate a model in to a web application.

To demonstrate the reading of a model from disk, we can read the model file we just saved, and make it predict 

In [None]:
model_from_file = pickle.load(open('model.pickle', 'rb'))  # read as binary

Once opened, we can immediately use it to make some predictions on some data:

In [None]:
model_from_file.predict(X[0:5])

# Exercise

Now we will perform an end-to-end machine learning classification. 

In this section, you will run the code, and answer some questions at the end. 

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import plot_tree
from sklearn.tree import export_text
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import GridSearchCV

## Skin Lesion Classification 

In this section we will go in to a deeper dive to random forests and perform an end-to-end machine learning pipeline, using some of the methods we have learned above. We will then see how to analyse the performance of a model. Your exercise for this section will be to train a random forest model, and after this we will analyse the results and performance of the model together. 

Before you train the model, we will first gather a dataset, related to skin lesions. We first look at the dataset's overall structure, the types of data it contains, and we will see how we need to prepare the data, or clean the data. 

The dataset realtes to the classification of images **erythemato-squamous diseases** (**ESDs**). We will not analyse the images directly - we will use a dataset containing certain observations or characteristics for 366 skin lesions from patients. The characteristics are visual features that were recorded by a dermatologist. These include, for example, "itching", "scaling", or "erythema", as well as a severity score. 

- We can use these characteristics to tain a decision tree, rather than analysing the images directly. 
- This is an alternative to the deep neural networks (convolutional neural networks) commonly used today. To analyse images directly, typically you would use Deep Learning and Neural Networks. We will look at Deep Learning tomorrow.  
- A decision tree uses the characteristics to learn **rules** for diagnosing an image

Random Forests are often preferred in medicine as they are interpretable. You can view the rules that the trees use to make their decisions.

So, once we have prepared the data for training, your task will be to train a random forest classifier. 

Here are some examples of the skin lesions, these are images of psoriasis however the dataset consists of a number of different skin diseases:

<img src="./img/psoriasis-edit.jpg" width="1200px"/>

*Image couresty of Dash, M et al. "A cascaded deep convolution neural network based CADx system for psoriasis lesion segmentation and severity assessment". Applied Soft Computing, 91 (2020).*

What are erythemato-squamous diseases (ESDs)?

>*Erythemato-squamous diseases are common skin diseases. They consist of six different categories: psoriasis, seboreic dermatitis, lichen planus, pityriasis rosea, chronic dermatitis and pityriasis rubra pilaris. They all share the clinical features of erythema and scaling with very little differences. Their automatic detection is a challenging problem as they have overlapping signs and symptoms.*
>
>A. M. Elsayad, M. Al-Dhaifallah and A. M. Nassef, "Analysis and Diagnosis of Erythemato-Squamous Diseases Using CHAID Decision Trees". 15th International Multi-Conference on Systems, Signals & Devices (2018):252–262.

There are 6 different types of erythemato-squamous diseases. They have very similar clinical features, and can have overlapping signs and symptoms. Therefore their detection is a challenging problem. Another difficulty in differential diagnosis is that a disease in its early stages may show the characteristics of another disease and only shows its characteristic features in later stages.

Now that we know a little bit about our dataset, the first step is to load the dataset, and preview the first 10 rows:

In [None]:
derma = pd.read_csv('data/dermatology-clinical-only.tsv', sep='\t')
derma.head(10)

This dataset is much cleaner and better prepared than the heart disease dataset we looked at previously. We can more or less leave it as it is, as most of the columns are already ordinally encoded. 

The only exception to this is the age, which we could scale using a scaler, as we did with the heart disease data. However, in this example, we will not do this for the sake of brevity.

The first thing we might want to do is list the dataset's features. We can do this by looking at the column names:

In [None]:
pd.DataFrame(list(derma.columns), columns=["Feature"], index=range(1, (len(derma.columns)+1)))

Here you can see we have 12 features, as well as our diagnosis. 

Each feature is described in the paper relating to the dataset (Güvenir, H. Altay et al. "Learning differential diagnosis of erythemato-squamous diseases using voting feature intervals". Artificial Intelligence in Medicine, 13 3 (1998): 147-65), and can be seen below:

| Clinical Attribute | Value Range | Description |
|-------------------|-------------|-------------|
| Erythema | 0-3 | Redness of the skin due to inflammation |
| Scaling | 0-3 | Peeling or flaking of the outer skin layer |
| Definite Borders | 0-3 | Clarity/distinctness of lesion boundaries |
| Itching | 0-3 | Severity of pruritus (itching sensation) |
| Koebner Phenomenon | 0-3 | Development of lesions at sites of skin trauma |
| Polygonal Papules | 0-3 | Small, angular raised bumps on the skin |
| Follicular Papules | 0-3 | Small bumps centered around hair follicles |
| Oral Mucosal Involvement | 0-3 | Extent of mouth lining affected |
| Knee and Elbow Involvement | 0-3 | Presence of symptoms at these joints |
| Scalp Involvement | 0-3 | Extent of scalp affected |
| Family History | 0-1 | Presence (1) or absence (0) of disease in family |
| Age | Linear | Patient's age in years |

We can see that a scoring system is used for most of the features. The scoring system uses the following scale for most features:

- 0: Feature is absent
- 1: Mild presence
- 2: Moderate presence
- 3: Maximum presence

There are two exceptions:

Family History:
- 0: No family members have had any of these diseases
- 1: At least one family member has had one of these diseases

Age:
- Recorded as the actual numerical age of the patient in years

Note that the clinical features, such as Scaling or Scalp Involvement, are already ordinal and encoded properly.

Let's now take a look at the diagnosis, to see how many different classes there are: 

In [None]:
pd.DataFrame(derma['diagnosis'].value_counts())

So we can see that we are not even distributed. This can sometimes lead to issues with intepreting classification accuracy. We will learn more about this tomorrow.

We can even plot this quickly, to get a better idea of the distrubtion of the target classes. Taking a look at the distribution of the target classes, is normally one of the first things you will look at if you given a dataset to analyse:

In [None]:
derma['diagnosis'].value_counts().plot(kind='bar');

As mentioned, these are not evenly distributed. We will need to be careful how we interpret our results later...

### Cleaning the Data

After examining the class target distribution, and taking a look at how the dataset is structured then normally the next thing to do is prepare the data and 'clean' it. 

Cleaning data can consist of several things: 

- Examing the data types involved, for example some columns may be continuous such as age, some might be categorical/nominal (such as city), some might be ordinal (such as severity)
- many algorithms need to have categorical data converted into numerical data, for example
- handling missing data: either by removing rows with missing data, or imputing values using the mean or median (e.g. a missing age could be replaced with the mean age of the age column).
- scaling the data, often between `0` and `1` or `-1` and `1`: this is to ensure features are treated equally - e.g. age might be between 0-100 while salary might go from 0-1,000,000 and larger scale features can end up dominating the learning process. Scaling means all values peak at 1, and the relative differences between samples is kept.
- standardising data, for example dates might be inconsistently formatted (month/day vs. day/month) for example
- standarding might also include ensuring units are stored in a consistent way, so that volume is stored consistently in ml, and any values like 1.3l is converted to 1300ml.
- High cardinality: you may want to group certain categories. For example, you might have 10 rare diseases that appear only a few times each in the dataset, it might make sense to simply create one group called 'other' for all 10 of these diseases. Or similar categories can be merged, even if they appear often in the dataset.
- Merging can also be used to balance the dataset, for example if you had 100 high grade tumour cases, and 40 intermediate grade tumour cases and 20 low grade tumour cases, you might just balance the dataset and create two classes: 100 high grade tumours and 60 intermediate/low grade tumours.

Bear in mind, when converting data that we have these main types of categorical data:

- **Ordinal**: the order has meaning. For example, the symptom severity above. We want to give the more severe symptom the largest value, and the values for the various other gradings are in relative order.
- For example: severe is 3, moderate is 2, mild is 1, and no symptoms is 0.
- On the other hand, **nominal** data has no order. For example, colours are not ordered in any way. Therefore, if you were to encode colours numerically, e.g. Blue is 4 and Red is 17, the algorithm might conclude that Red is 'better'/'stronger'/'worth more' than Blue - which is not the case. Therefore, nominal data is often **one hot encoded**

Original dataset has a field for city, that we must encode for the algorithm to be able to handle it. Most algorithms will not work with strings, they need to be encoded numerically some how. Here is the original dataset:

| Age | City   | Diagnosis |
|-----|--------|-----------|
| 77  | London | Malignant |
| 44  | NYC    | Benign    |
| 24  | Vienna | Benign    |

You wish to encode the City feature, however this is **nominal** and we do not wish to imply some sort of order (even if this existed, we might not know the order). 

Therefore, the values are encoded as vectors: istead of `London = 1`, `NYC = 2`, and `Vienna = 3` we have `London = [1, 0, 0]`, `NYC = [0, 1, 0]` and `Vienna = [0, 0, 1]`. The `1` merely reprsents the presence or absense of a value. The dataset would then look like this:

| Age | London | NYC | Vienna | Diagnosis |
|-----|--------|-----|--------|-----------|
| 77  | 1      | 0   | 0      | Malignant |
| 44  | 0      | 1   | 0      | Benign    |
| 24  | 0      | 0   | 1      | Benign    |

Now we have 3 features, and the presense of a 1 says that this person lives in this city. It says nothing else about the other cities, only that the patient does **not** live there. 

It is also good practice to document any changes you make to the data, as you will normally need to make certain assumptions as you clean the data. Jupyter notebooks serve as a perfect way to document the changes as you can display step by step exactly what you did to the data in order to clean it.

With this dataset, it is relatively easy because all the cinical fields have values between 0-3 and are already ordinal and require no further editing work. The exceptions are the family history which is binary (0 or 1), and age which is continuous. The binary field also does not require work. The age field could be scaled, if required, but we will not do this in order to save time.

Therefore, the only cleaning we really have to do is for missing values and remove them:

In [None]:
# Convert features to numeric
for col in derma.columns:
    if col != 'diagnosis':
        derma[col] = pd.to_numeric(derma[col], errors='coerce')

# Handle missing ages (replace with median)
derma['age'] = derma['age'].replace('?', np.nan)
derma['age'] = pd.to_numeric(derma['age'])
derma['age'] = derma['age'].fillna(derma['age'].median())

# Finally, if we missed any missing data we can just drop the rows that contain missing values
derma = derma.dropna(how='all')

In [None]:
derma.head()

Once this is done, we will train a Random Forest model.

Random Forests are a powerful and often used model in medicine due to their ability to handle complex data while also maintaining interpretability. This contrasts to many other models which are 'black boxes' such as neural networks, where it is very difficult to work out why they have made a particular classificaiton. In healthcare applications, Random Forests's inherent feature importance ranking helps clinicians identify which variables most strongly influence predictions, making them valuable for both diagnostic and prognostic modeling. Furthermore, Random Forests are relatively resistant to overfitting compared to other complex models, making them particularly suitable for medical datasets that are often limited in size.

Run the code to train the Random Forest:

In [None]:
X = derma.drop('diagnosis', axis=1)
y = derma['diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Make predictions
y_pred = rf.predict(X_test)

We do not perform a 5-fold cross validation in order to keep things simple. 

Now that we have trained the model, let's look at the overall accuracy of the model at classifying the lesions:

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {(accuracy * 100):.2f}%")

This is the accuracy across all 6 skin lesion classes, and looks like a good score.

However... relying on only this metric might be problematic for a few reasons, namely we have a unevenly distributed diagnosis class. 

Let's look at the distribution of the classes in the test set. This can be done using the `classification_report()` function, which breaks down the accuracy on a class-by-class basis, and also provides other valuable information:

In [None]:
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

What can be observed here?

First of all, the `support` field shows how many samples from each class are in the test set. Observe that this is not balanced.

For example, notice that psoriasis is over-represented. There are 31 samples out of 74. This is 42% of all the data. Now imagine the network classified all lesions as psoriasis, then it would be 42% accurateo overall... Why might this be a problem?

Also observe that Pityriasis Rubra Pilaris has perfect accuracy, however there are only 3 samples so we need to be careful with how we interpret this. 

A few other observations, from a clincal point of view:

- For less common conditions like Pityriasis Rubra Pilaris, the sample size is quite small (only 3 cases), so the perfect score should be interpreted cautiously
- The model seems to be most reliable at ruling in Lichen Planus and Psoriasis (high precision)
- It's best at not missing cases of Pityriasis Rosea (perfect recall) but may overdiagnose it
- The model has an overall accuracy of 85%, which is quite good for a multi-class skin condition classifier

Hence, you must carefully analyse the results of a model, printing the accuracy overall is often not enough. 

Another good way to analyse and evaluate a model, is through the use of a so-called confusion matrix. 

SciKit-Learn includes a `confusion_matrix()` function:

In [None]:
print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(cm)

On its own, this might not look very useful. However if you plot it, with some colour in the form of a heatmap, it is a very useful tool to interpret and evaluate a model's classification performance.

Let's plot the confusion matrix:

In [None]:
# Visualise the confusion matrix
classes = sorted(y.unique())

cm_df = pd.DataFrame(cm, index=classes, columns=classes)

plt.figure(figsize=(10, 8))

# Create heatmap
sns.heatmap(cm_df, annot=True, fmt='d', cmap='Blues',
            square=True, linewidths=0.5, cbar_kws={"shrink": .5})

# Customize the plot
plt.title('Confusion Matrix for Dermatological Diagnoses', fontsize=14, pad=20)
plt.xlabel('Predicted Label', fontsize=12)
plt.ylabel('True Label', fontsize=12)

# Rotate x-labels for better readability
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)

# Adjust layout to prevent label cutoff
plt.tight_layout()

plt.show()

The confusion matrix shows the model's "confusion" at predicting each class.

It plots the true labels versus the predicted labels for each class, and shows how often the predictions were correct, and when incorrent, what were the incorrect predictions.

For example, we can see that psoriasis was correctly predicted 26 times, but incorrectly predicted as pityriasis rosea twice, and seborrheic dermatitis 3 times.

Sometimes it makes sense to normalise the confusion matrix, so that the heatmap is more obvious:

In [None]:
# Visualise the confusion matrix
classes = sorted(y.unique())

cm_df = pd.DataFrame(cm, index=classes, columns=classes)

plt.figure(figsize=(10, 8))

# Create heatmap
cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
sns.heatmap(cm_norm, annot=True, fmt='.2f', cmap='Reds', xticklabels=classes, yticklabels=classes,
            square=True, linewidths=0.5, cbar_kws={"shrink": .5})

# Customize the plot
plt.title('Confusion Matrix for Dermatological Diagnoses (Normalised)', fontsize=14, pad=20)
plt.xlabel('Predicted Label', fontsize=12)
plt.ylabel('True Label', fontsize=12)

# Rotate x-labels for better readability
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)

# Adjust layout to prevent label cutoff
plt.tight_layout()

plt.show()

This shows the arruracy of the model more clearly, in percentage terms.

At a quick glance, the darker the diagonal, the better the model's classification accuracy across all classes. 

## Interpreting the Model

In this section we will now try to interpret the Random Forest **model** itself, rather than the results of the model.

### Visualise the Decision Tree

Radom Forests, as their name suggests, consists of many Decision Trees, and these trees are rule based classifiers. 

The trees learn yes-no rules based on the data, and these rules can be interpreted by visualising the trees. 

A Random Forest can contain many hundreds or thousands of decision trees. When a Random Forst makes a classification, each tree makes a classification indepently, and a vote occurs as to what the overall classification is. 

Below we will visualise a tree by plotting it. Our model consisted of 100 trees, we will visualise tree `0`:

In [None]:
# Plot the first tree in the forest
plt.figure(figsize=(20,10))
plot_tree(rf.estimators_[0], 
          feature_names=X.columns,
          class_names=classes,
          filled=True,
          max_depth=3)
plt.show()

As you can see, the rules are basically if-else statements. The top node asks `IF scalp involvement <= 0.5 THEN LEFT ELSE RIGHT` for example.

Data is passed through the tree until it reaches an end node, and this is the classification result.

The tree above is too large to be printed to screen, so therefore it is often easier to print the tree as text. 

We do so below:

In [None]:
tree_text = export_text(rf.estimators_[0], 
              feature_names=X.columns,
              class_names=classes)

print(tree_text)

Here all the rules are visible and nothing is trucated. 

You can use these tree visualisations to understand the rules that the Random Forest's trees are making when performing a classification, making them somewhat interpretable. 

Because we have access to the rules that each tree contains, feature importances can also be generated from Random Forest models.

### Feature Importance

As we just mentioned, we can also analyse which **features contrbuted the most to the classifications**. 

This might give us useful clinical insight into Erythemato-squamous diseases. For example, if the model consistently showed that knee and elbow involvement was an important feature in diagnosing Erythemato-squamous diseases, and this was not known before, it might provide new insights into the disease that could be further examined.

Run this code to see the top feature importances (they are available in our Random Forest model, `rf` in the `feature_importances_` attribute):

In [None]:
print("\nFeature Importance:")

feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

feature_importance

We can also plot this:

In [None]:
importances = pd.DataFrame({
    'feature': X.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='feature', data=importances)
plt.title('Feature Importance in Random Forest Model')
plt.tight_layout()
plt.show()

### Grid Search 

You may have noticed that there are quite a large number of parameters that can be chosen for the random forest algorithm. 



In [None]:
# Define parameter grid as a dictionary of parameters 
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15],
    'max_samples': [0.5, 0.8, 1.0]
}

# Calculate total number of combinations
n_combinations = np.prod([len(v) for v in param_grid.values()])
print(f"Total number of combinations to try: {n_combinations}")

# Initialize GridSearchCV
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=1,
    verbose=1
)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

In [None]:
# Create a DataFrame of all results
results = pd.DataFrame(grid_search.cv_results_)

# Sort results by mean test score
results_sorted = results.sort_values('mean_test_score', ascending=False)

In [None]:
results_sorted.head()

In [None]:
print("\nTop 5 parameter combinations:")
cols_to_show = ['params', 'mean_test_score', 'std_test_score']
results_sorted[cols_to_show].head()

In [None]:
plt.figure(figsize=(10, 6))
plt.hist(results['mean_test_score'], bins=20)
plt.title('Distribution of Cross-Validation Scores')
plt.xlabel('Mean Test Score')
plt.ylabel('Count')
plt.show()

In [None]:
from sklearn.model_selection import learning_curve

def plot_learning_curves(estimator, X, y):
    train_sizes = np.linspace(0.1, 1.0, 10)
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=5, n_jobs=-1, train_sizes=train_sizes,
        scoring='accuracy'
    )
    
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    test_mean = np.mean(test_scores, axis=1)
    test_std = np.std(test_scores, axis=1)
    
    plt.figure(figsize=(10, 6))
    plt.plot(train_sizes, train_mean, label='Training score')
    plt.plot(train_sizes, test_mean, label='Cross-validation score')
    
    plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1)
    plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.1)
    
    plt.xlabel('Training Examples')
    plt.ylabel('Score')
    plt.title('Learning Curves for Best Model')
    plt.legend(loc='best')
    plt.grid(True)
    plt.show()

In [None]:
best_model = grid_search.best_estimator_

In [None]:
best_model

In [None]:
plot_learning_curves(best_model, X, y)

What does this show? From the SciKit Learn documentation:

- This can show you if getting more data will help you: if both lines level out and plateau then adding more data probably won't help.
- Conversely, if the lines do not converge, then you probably would benefit from more data.
- If both lines plateau early on (i.e. at a low performance level), then you might be **underfitting**: try a model with more complexity. In terms of Random Forests that means more depth, or more trees.
- If there is a large gap between the training score and the test score (i.e. the training score is always much better) then you are **overfitting**. More data is often a solution to overfitting. Regularisation is another method - this prevents models from becoming too complex, which disallows it from learning the data exactly. This is different depending on the algorithm. In Random Forests you might want to control the maximum depth of the trees, for example.

Ideally you want both curves high and close together and to converge.

Parameter importance:

In [None]:
param_importance = pd.DataFrame({
    'Parameter': [],
    'Impact': []
})

In [None]:
for param in param_grid.keys():
    scores = []
    values = param_grid[param]
    for value in values:
        mask = results['params'].apply(lambda x: x[param] == value)
        score = results[mask]['mean_test_score'].mean()
        scores.append(score)
    impact = max(scores) - min(scores)
    param_importance = pd.concat([param_importance, 
                                  pd.DataFrame({'Parameter': [param], 'Impact': [impact]})],
                                  ignore_index=True)

plt.figure(figsize=(10, 6))
plt.bar(param_importance['Parameter'], param_importance['Impact'])
plt.title('Parameter Impact on Model Performance')
plt.xlabel('Parameter')
plt.ylabel('Impact (Max - Min Score)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## End of Day 1

That was our content for Day 1.

Day 2 will focus on Neural Networks and Deep Learning, as well as model deployment.

Topics include:

- Simple neural networks: we will use PyTorch to define a simple neural network and train it
- PyTorch: we will discuss the framework we will use to create neural networks
- Deep learning: we move to deep networks, the basis for all of the recent advancements, such as generative models and GPT models
- Image classification: we will train an image classifier on a number of small tasks
- Image segmentation: we discuss image segmentation in the context of medicine
- Pre-trained models: use networks that have already been trained
- Fine-tuning models: adapt a pre-trained network to your specific task 
- Model deployment and web application development
- Assignment

---