# Introduction to Machine Learning in Python

Welcome to this introductory session on Machine Learning with Python! Machine learning is a powerful tool that allows computers to learn from data and make predictions or decisions without being explicitly programmed. This technology is at the heart of many modern applications, from recommendation systems and image recognition to financial forecasting and autonomous driving.

In this session, we’ll focus on understanding the basic concepts of machine learning and how to implement them using Python. We’ll start with the fundamental principles, explore essential algorithms, and use Python’s powerful libraries to build and evaluate simple machine-learning models. By the end of this workshop, you will have a foundational understanding of machine learning and the practical skills to start applying these techniques to your own projects.

Here is an outline of what we will talk about today:

1. [Introduction to Machine Learning](#Introduction-to-Machine-Learning)
2. [Data Preparation and Exploration](#Data-Preparation-and-Exploration)
3. [Introduction to Supervised Learning](#Intoduction-to-Supervised-Learning-in-Python)
4. [Q&A](#Q&A)

# Introduction to Machine Learning

Machine learning (ML) is the study of computer algorithms that can improve automatically through experience and by the use of data.

The major difference between machine learning and statistics is their purpose. Machine learning models are designed to make the most accurate predictions possible. Statistical models are designed for inference about the relationships between variables.

Machine learning is all about results, it is likely working in a company where your results are characterized solely by your how good they are. Whereas, statistical modeling is more about finding relationships between variables and the significance of those relationships, whilst also catering for prediction.

![ml_buldozer](img/ml_buldozer.png)

[*Source*](https://xkcd.com/1838/)

## Types of machine learning problems
Not all machine learning problems are created equal, nor they use the same methods to make predictions. Given the problems nature, available data and the reserahcer's goal, a machine learning problem can fall into one of the following categories:

### Supervised learning

Supervised learning (SL) is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. In supervised learning, each example is a pair consisting of an input and a desired output value. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations.

The most widely used learning algortithms are:

  - Support-vector machines
  - Linear regression
  - Logistic regression
  - Naive Bayes
  - Decision trees
  - K-nearest neighbor algorithm
  - Neural networks (Multilayer perceptron)
  
Supervised learning methods are used for problems like:
  - Classification; assigning a categorical laber to each observation
  - Regression; assigning a numeric value to each observation

### Unsupervised learning

Unsupervised learning (UL) is a type of machine learning in which the algorithm is not provided with any pre-assigned labels or scores for the training data. As a result, unsupervised learning algorithms must first self-discover any naturally occurring patterns in that training data set. Advantages of unsupervised learning include a minimal workload to prepare and audit the training set, in contrast to supervised learning techniques where a considerable amount of expert human labor is required to assign and verify the initial tags, and greater freedom to identify and exploit previously undetected patterns.

The most widely used learning algortithms are:

  - Hierarchical clustering
  - K-means
  - Mixture models

Supervised learning methods are used for problems like:
  - Clustering; find the optimal way to slit the data into distinct groups
  - Dimensionality reduction; project the data in a lower dimensional space

### Reinforcement learning
Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning differs from supervised learning in not needing labelled input/output pairs be presented, and in not needing sub-optimal actions to be explicitly corrected. Instead the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

# Data Preparation and Exploration

Data preparation is a critical step in any machine learning workflow. The quality and structure of your data directly affect the performance of your model. In this section, we will cover the essential steps for preparing your data for machine learning:

1. Loading datasets in Python.
2. Cleaning data by handling missing values and removing duplicates.
3. Performing Exploratory Data Analysis (EDA) to understand the data.
4. Feature engineering, including feature scaling and encoding categorical variables.

Let's get started with loading a dataset. In this section we will work with the [*heart disease*](https://archive.ics.uci.edu/dataset/45/heart+disease) dataset, where the goal is to predict the precense of heart disease (values >0) or not (0).

Let's explore our dataset!



## Loading datasets in Python

Some datasets can be more easily loaded, *without the need to downalod the data in our local machine.* For this example, our dataset is stored on the [UC Irvine ML Repository](https://archive.ics.uci.edu) and we can download the data like that!

In [2]:
from ucimlrepo import fetch_ucirepo

heart_disease = fetch_ucirepo(id=45)
dataset = heart_disease.data
dataset.features.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0


The advantage of the second method is that our dataset is a more complex Python object with more metadata about the dataset!

In [3]:
heart_disease.metadata

{'uci_id': 45,
 'name': 'Heart Disease',
 'repository_url': 'https://archive.ics.uci.edu/dataset/45/heart+disease',
 'data_url': 'https://archive.ics.uci.edu/static/public/45/data.csv',
 'abstract': '4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach',
 'area': 'Health and Medicine',
 'tasks': ['Classification'],
 'characteristics': ['Multivariate'],
 'num_instances': 303,
 'num_features': 13,
 'feature_types': ['Categorical', 'Integer', 'Real'],
 'demographics': ['Age', 'Sex'],
 'target_col': ['num'],
 'index_col': None,
 'has_missing_values': 'yes',
 'missing_values_symbol': 'NaN',
 'year_of_dataset_creation': 1989,
 'last_updated': 'Fri Nov 03 2023',
 'dataset_doi': '10.24432/C52P4X',
 'creators': ['Andras Janosi',
  'William Steinbrunn',
  'Matthias Pfisterer',
  'Robert Detrano'],
 'intro_paper': {'title': 'International application of a new probability algorithm for the diagnosis of coronary artery disease.',
  'authors': 'R. Detrano, A. Jánosi, W. Steinbrunn, M.

But let's get to know our dataset a little more:

- age: Age (in years)
- sex (0 for male, 1 for female)
- cp: Chest pain type:
  - 1: Typical angina
  - 2: Atypical angina
  - 3: Non-anginal pain
  - 4: Asymptomatic
- trestbps: Resting blood pressure (in mm Hg on admission to the hospital)
- chol: Serum cholesterol (mm/Hg)
- fbs: Fasting blood sugar > 120 mg/dl (1=True, 0=False)
- restecg: resting electrocardiographic results (0: Normal, 1: Abnormal)
- thalah: Maximum heart rate achieved
- exang: Exercise induced angina (1=yes, 0=no)
- oldpeak: ST depression induced by exercise relative to rest
- slope: The slope of the peak exercise ST segment:
  - 1: upsloping
  - 2: flat
  - 3: downsloping
- ca: Number of major vessels (0-3) colored by flourosopy
- Thal:
  - 3: normal
  - 6: fixed defect
  - 7: reversable defect

## Cleaning and handling duplicates

Most of the ML algorithms do not like missing values! Most of them will break with the presence of missing values in the features, or will drop the observations with the missing values!

In ML there are two ways you can treat missing values. 

- Drop the rows/columns with the missing observations
- Impute the missing value based on a metric (eg mean, median, ...) or a more complex method (eg KNN, Linear regression, ...)

The selection of method dependds on the percerntage of missing values you have on your dataset, and also about the importance of the columns containing the missing values! For this tutorial, we will proceed with dropping the columns/rows with missing obseravtions!

To make sure that we do not have any missing values, we will check our dataset usng the `is.na()` function from pandas! And then we can use the `.dropna()` function to remove the rows with the missing values

In [4]:
# Isolate the predictors
x = dataset.features

# Check for missing values
x_isna = x.isna()

# is.na() returns a matrix with TRUE/FALSE, depending on if the obesravtion is missing or not. To get a sense of what's going on, we need to sum per column
x_isna.sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          4
thal        2
dtype: int64

As the number of observations with missing values, we can safely remove the observations

In [5]:
# Drop the rows withthe missing values
x_cln = x.dropna(axis=0)
x_cln.shape

(297, 13)

We will do the same chacks for our tarhet variable as well

In [6]:
y = dataset.targets
y = y.replace({2:1, 3:1, 4:1})
n_na_tar = y.isna().sum()
print(f'Number of missing values in the target: {n_na_tar}')

Number of missing values in the target: num    0
dtype: int64


The next step in the process is to check for duplicates. It's not uncommon for datasets to include observations that are repeated. We can check for duplicates usnbe the `.duplicated()` method in pandas.

In [7]:
# Check for duplicates by adding the number or TRUE, which symbolises duplictaed rows.
n_dups = x_cln.duplicated().sum()
print(f'Number of duplicated rows in a dataset: {n_dups}')

Number of duplicated rows in a dataset: 0


In [8]:
## Exploratory Data Analysis (EDA)

#Now that our dataset is free from missing values and duplicates, we can start exploring our dataset so we can understand better the relationships between the features, and also relationships between #the features and the target.

#EDA is a long process, especially when the dataset is new and/or you have no domain specific information 

## Feature Engineering

Now that our dataset is free from missing values and duplicates, we can start thinking about how to transform our features in a way that can be correctly used by the machine learning algorithms. We will use different techniques for categorical and numerical variables.

## Categorical features

Almost every machine learning algorithm in Python will require some kind of transformation of the categorical features. While there are multiple ways of transforming categorical variables, the most widely used method is called **On-hot encoding**! 

In this method, each level of the categorical feature is represented by its own column, with the value 1 of the level is present in that observation and 0 when it's absent

![ml_buldozer](img/one_hot_encoding.png)

[*Source*](https://www.statology.org/one-hot-encoding-in-python/)

**Question**: What do we do with the binary values? (eg Cat/Dog)?

### One-hot encoding in Python

There are 2 ways to perform one-hot encoding in Python. We can use either:

- the pandas package
- the scikit-learn package

Pandas is more easy to use, while scikit learn is more easily combined with the rest of the ML algorithms (more to that, later today!)

#### OHE in Pandas

Pandas has the built-it function `get_dummies()` that performs the OHE operation. 



In [9]:
# Perform OHE using pandas.get_dummies()
import pandas as pd

dataset_ohe = pd.get_dummies(x_cln)
dataset_ohe.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal'],
      dtype='object')

`pd.get_dummies()` accepts a dataframe will both categorical and numerical variables, but willl **only transform the categorical variables!** Because data types are tricky in pandas, I'd reccomend to always use *columns* argument to specify which columns you want to encode.

As the following arguments as well:

- *sparse = False*: Unless you have a huge dataset (which will contain millions of 0's), I'd reccomend to return a normal dataframe rather than a sparse one
- *dummy_na* in case you want your NAs to be a label on their own
- *drop_first = True* will be usefull in the ML setting

Let's take a subset of our dataframe and perform this *the right-way*

In [10]:
dataset_sub = x_cln.loc[:,["age", "sex", "cp", "slope"]]
pd.get_dummies(dataset_sub, columns=["sex", "cp"], drop_first=True, sparse=False).replace({False:0, True:1})

  pd.get_dummies(dataset_sub, columns=["sex", "cp"], drop_first=True, sparse=False).replace({False:0, True:1})


Unnamed: 0,age,slope,sex_1,cp_2,cp_3,cp_4
0,63,3,1,0,0,0
1,67,2,1,0,0,1
2,67,2,1,0,0,1
3,37,3,1,0,1,0
4,41,1,0,1,0,0
...,...,...,...,...,...,...
297,57,2,0,0,0,1
298,45,2,1,0,0,0
299,68,2,1,0,0,1
300,57,2,1,0,0,1


#### OHE in Scikit-learn

Scikit-learn has another way to perform OHE via the the `OneHotEncoder` function. All scikit-learn's operations are implementted the following way:

- **initiate**: Instantiate the object
- **fit**: *Train* the object using the data
- **transform**: Transform the data

Similar aeguments to have allok at:

- *drop*
- *sparse_output*

In [11]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse_output=False)
ohe.fit(dataset_sub)
ohe.transform(dataset_sub)

# You can concatinate this in one line using
ohe.fit_transform(dataset_sub)


array([[0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       ...,
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 1., 0.]])

You see that the outcome of the the scikit-learn's version is a `np.array` rather than a pd.Dataframe, and that's fine for now

## Numerical variables in Python

Numerical variables are easier to deal with when preparing a dataset for a machine learning project. While most of the ML algorithms do not require any transformation before applying, some of them work better after transformaiton. The transformation I recocment is the **standardization**

Standardizing a feature happens when we substract the mean and we devide by the standard deviation:

$$
\frac{x_i - \bar{x}}{s_x}
$$

While it's easy to do this transformation using pandas, we will use the scikit-learn's implementtation in this tutorial

In [12]:
from sklearn.preprocessing import StandardScaler

scl = StandardScaler()
scl.fit(dataset_sub)
scl.transform(dataset_sub)

array([[ 0.93618065,  0.69109474, -2.24062879,  2.26414539],
       [ 1.3789285 ,  0.69109474,  0.87388018,  0.6437811 ],
       [ 1.3789285 ,  0.69109474,  0.87388018,  0.6437811 ],
       ...,
       [ 1.48961547,  0.69109474,  0.87388018,  0.6437811 ],
       [ 0.27205887,  0.69109474,  0.87388018,  0.6437811 ],
       [ 0.27205887, -1.44697961, -1.20245913,  0.6437811 ]])

One good thing about scikit learn is that all the functionality is applied the same way, regardless if it's a processing step, or a ML learning algorithm!

# Intorduction to Supervised Learning in Python

Most of the problems that we face in ML fall into the category of supervised learning. This means that we want either to assign a numerical value to an observation (regression) or assign it to one of the pre-defined classes (classification). We do that by having a *label* associated with each observation, from which we can use a model to find the relationships between the label (also known as response or target), and the features.

![supervised_learning](img/supervised_learning.webp)

[*Source*](https://www.kdnuggets.com/understanding-supervised-learning-theory-and-overview)

But.. how do we assure that our algorithm performs as expected?

## Train-test split in Python

Train test split is a model validation procedure that allows you to simulate how a model would perform on new/unseen data. Here is how the procedure works:

![supervised_learning](img/train_test_split.jpg)

[*Source*](https://builtin.com/data-science/train-test-split)

In python, we can use the `train_test_split()` like:

In [13]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y)

print(sorted(x_train.index))
print(sorted(x_test.index))

[2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 17, 19, 22, 23, 24, 25, 27, 28, 29, 30, 32, 33, 34, 35, 36, 37, 39, 40, 41, 42, 43, 45, 46, 47, 48, 50, 51, 52, 53, 55, 56, 57, 58, 60, 61, 62, 64, 65, 68, 69, 70, 71, 72, 73, 74, 75, 76, 78, 79, 80, 82, 83, 84, 87, 88, 89, 90, 91, 92, 93, 94, 96, 97, 98, 99, 101, 102, 103, 104, 105, 106, 107, 110, 111, 113, 114, 115, 116, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 139, 140, 141, 142, 143, 145, 146, 147, 149, 151, 152, 153, 154, 155, 157, 158, 159, 160, 161, 163, 164, 166, 168, 170, 171, 172, 173, 174, 176, 177, 178, 179, 181, 183, 184, 185, 186, 187, 189, 191, 193, 194, 196, 197, 198, 199, 200, 201, 203, 204, 205, 207, 208, 209, 210, 212, 214, 215, 216, 219, 220, 224, 225, 226, 227, 229, 230, 234, 235, 236, 237, 239, 241, 243, 245, 246, 247, 251, 252, 253, 254, 255, 256, 257, 259, 260, 261, 262, 264, 267, 268, 269, 270, 271, 272, 273, 274, 276, 277, 278, 279, 280, 281, 282, 283, 286, 287

A few arguments to keep in mind while using this function:

- *train/test_size* will determine the proportion of your test or train set (default is 70/30 I think...)
- *stratify = True* will make sure that the proportion of lables in the train and test set are equal (for discrete labels)
- *random_state* to ensure reproducability

Now we can safely train on the train set and then asses the performance of our algorithm in the test set! Case closed?

## Cross validation

What if we happen to just take a good/bad proportion of data to train on? An idea is to take a few splits

![supervised_learning](img/kfold_cv.webp)

[*Source*](https://towardsdatascience.com/cross-validation-k-fold-vs-monte-carlo-e54df2fc179b)

In python, we can create folds like that using the `KFold` function:

In [14]:
from sklearn.model_selection import KFold

cv = KFold(n_splits = 3)
for train ,test in cv.split(x,y):
    print(f"Train set: {train}")
    print(f"Test set: {test}")

Train set: [101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136
 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154
 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172
 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190
 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208
 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226
 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244
 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262
 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280
 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298
 299 300 301 302]
Test set: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38

Now its a good time to put them all together and train our first supervised machine learning model: A logisitc regressioh model!

## Train a ML model

We'll put totgether all the steps we implemented before. I fact we will:

1. One-hot encode the categorical variables on the training data
2. Standardize the numerical variables on the test data
3. Train the model using the training data
4. Apply the transformations on the test data
5. Test the trained model on the test data



In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn import set_config
set_config(transform_output = "pandas")

# Take a subset of the data
numerical_vars = ["age"]
categorical_vars = ["sex", "cp", "slope"]
x = x[numerical_vars + categorical_vars]

# Split the data into train and test
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, stratify=y)

# Numerical transformations
scl = StandardScaler()
scl.fit(x_train[numerical_vars])
x_num_tr = scl.transform(x_train[numerical_vars])

# Categorical transformations
ohe = OneHotEncoder(sparse_output=False, drop="first")
ohe.fit(x_train[categorical_vars])
x_cat_tr = ohe.transform(x_train[categorical_vars])

# Merge the categorical and numerical features
x_train = pd.merge(x_num_tr, x_cat_tr, left_index=True, right_index=True)


# Train the logistic regressio model
logreg = LogisticRegression()
logreg.fit(x_train, y_train)

# Transform the test data based on the transformation of the train data
x_num_test = scl.transform(x_test[numerical_vars])
x_cat_test = ohe.transform(x_test[categorical_vars])
x_test = pd.merge(x_num_test, x_cat_test, left_index=True, right_index=True)

# Predict using the model on the test data
predicted_probs = logreg.predict_proba(x_test)[:,1]
predictions = [1 if prob > 0.5 else 0 for prob in predicted_probs]

  y = column_or_1d(y, warn=True)


## How to evaluate a ML algorithm

Now that we have the results from the test set, we can put them against the true values (rememeber, we know the true labels of the test set) using differnt metrics. Some of them are listed here:

![metrics](img/metrics.png)

[*Source*](https://learnanalyticshere.wordpress.com/2021/02/06/machine-learning-easy-reference/)

and we can use them in python by using the respecitve functions in python 

- Accuracy: `sklearn.metrics.accuracy`
- Precision: `sklearn.metrics.precision_score`
- Recall: `sklearn.metrics.recall_score`
- F1-score: `sklearn.metrics.f1_score`

For this exmaple we will use the F1-score as a metric

In [16]:
from sklearn.metrics import f1_score

f1_test_score = f1_score(y_test, predictions)
f1_test_score

np.float64(0.8813559322033898)

And what about the K-Fold case, where we can have multiple test sets? Do not worry, is as easy as wrting a *for loop* over what we did, while being a bit careful!

In [17]:
import numpy as np

all_scores = []

for train_idx, test_idx in cv.split(x,y):

    # Split the data into train and test
    x_train = x.iloc[train_idx]
    x_test = x.iloc[test_idx]
    y_train = y.iloc[train_idx]
    y_test = y.iloc[test_idx]

    # Numerical transformations
    scl = StandardScaler()
    scl.fit(x_train[numerical_vars])
    x_num_tr = scl.transform(x_train[numerical_vars])

    # Categorical transformations
    ohe = OneHotEncoder(sparse_output=False, drop="first")
    ohe.fit(x_train[categorical_vars])
    x_cat_tr = ohe.transform(x_train[categorical_vars])

    # Merge the categorical and numerical features
    x_train = pd.merge(x_num_tr, x_cat_tr, left_index=True, right_index=True)


    # Train the logistic regressio model
    logreg = LogisticRegression()
    logreg.fit(x_train, y_train)

    # Transform the test data based on the transformation of the train data
    x_num_test = scl.transform(x_test[numerical_vars])
    x_cat_test = ohe.transform(x_test[categorical_vars])
    x_test = pd.merge(x_num_test, x_cat_test, left_index=True, right_index=True)

    # Predict using the model on the test data
    predicted_probs = logreg.predict_proba(x_test)[:,1]
    predictions = [1 if prob > 0.5 else 0 for prob in predicted_probs]
    all_scores.append(f1_score(y_test, predictions))

print(f"The mean F1 score accross folds is:{np.mean(all_scores)}")

The mean F1 score accross folds is:0.7759533311162842


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


## A more instersting case: training a Random Forest model!

We did it, we split the data into test and train, in more than one ways, we trained a ML model from scratch and we did asses its performance using a few metrics.

But do we have to write the exact same loop every time we want to do a cross validation? Let's see a diferent approach using scikit learn, and of course, a different model!


### Putting it all together using columnTransformers and pipelines

`pipeline` is a tool for building structured workflows that seamlessly combine data preprocessing, transformation, and model training into a single, streamlined process. A pipeline organizes a series of operations into a defined sequence, ensuring each step is applied consistently and correctly across both training and testing datasets.

Creating a pipeline is very easy, as everything in scikit-learn! The only two components we need is an actual transformers (as the ones we have used before) and a name

In [18]:
from sklearn.pipeline import Pipeline

numeric_transformer = Pipeline(
    steps=[
        (
            "scaler", 
            StandardScaler()
        )
    ]
)

categorical_transformer = Pipeline(
    steps=[
        (
            "onehot",
            OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop="first"),
        )
    ]
)

These transformers are a simple example; you can use any other transformartion you want as part of your pipelines. One transformer you can pair these existing pipeliines is how to handle missing data. You could possibly have:

- An extra step in the `numeric_transformer` to impute missing values (eg with the median)
- An extra step in the `categorical_transformer` to repalce missing values with "Unknown", ths creating an extra column while One-Hot Encoding

And the way these pipleines are used is the same way as their underlying transformers

In [22]:
numeric_transformer.fit_transform(x_cln[["age"]])

Unnamed: 0,age
0,0.936181
1,1.378929
2,1.378929
3,-1.941680
4,-1.498933
...,...
297,0.272059
298,-1.056185
299,1.489615
300,0.272059


Now that we have set different pipelines for the different kinds of variables, it's time to combine them into one single transformer! Scikit-learn has covered us on this one, using the `ColumnTransformer` function! In `ColumnTransformer` we are dictating which transformers to be used on which columns, plus giving them a name (more on why later).

In [25]:
from sklearn.compose import ColumnTransformer

# Combine preprocessing steps using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numerical_vars),
        ("cat", categorical_transformer, categorical_vars),
    ]
)
preprocessor.fit_transform(dataset_sub)

Unnamed: 0,num__age,cat__sex_1,cat__cp_2,cat__cp_3,cat__cp_4,cat__slope_2,cat__slope_3
0,0.936181,1.0,0.0,0.0,0.0,0.0,1.0
1,1.378929,1.0,0.0,0.0,1.0,1.0,0.0
2,1.378929,1.0,0.0,0.0,1.0,1.0,0.0
3,-1.941680,1.0,0.0,1.0,0.0,0.0,1.0
4,-1.498933,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...
297,0.272059,0.0,0.0,0.0,1.0,1.0,0.0
298,-1.056185,1.0,0.0,0.0,0.0,1.0,0.0
299,1.489615,1.0,0.0,0.0,1.0,1.0,0.0
300,0.272059,1.0,0.0,0.0,1.0,1.0,0.0


Pipelines are so versatile that allow us to combine all the transformers we defined with a machine learning model! In this example, we pair it with a [Random Forest](https://en.wikipedia.org/wiki/Random_forest) model. 

And a little reminder on how Random Forests work

![metrics](img/random_forest.webp)

[*Source*](https://medium.com/@denizgunay/random-forest-af5bde5d7e1e)

Let's combine them all together now!

In [28]:
from sklearn.ensemble import RandomForestClassifier

ml_pipeline = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier",RandomForestClassifier())]
)

Now `ml_pipeline` knows everything we want to do with our data! We have declared what to do with the data (and in which columns) and which classifier to use! 

The reason why we did all this work to combine all these pre-processing steps is so we can automate the train/test *for loops* we used before, with the sklearn function `cross_val_score()`! By providing the:

- classifier/regressor (or in our case, the pipeline)
- features
- target
- CV 
- metric to be used

the function will spit out the accross all the folds! One cool thing is that it auto-magically trains the transformers/classifier on the training sets (as dictated by the cv) and then predicts on the test set!

But let's see it in action!

In [35]:
from sklearn.model_selection import cross_val_score

scores_cv = cross_val_score(ml_pipeline, x, y.values.ravel(), cv=cv, scoring="f1")
scores_cv

array([0.77272727, 0.71428571, 0.74725275])

That's way better than a long for loop, right?

### How to tune a complex model? 

While logisitc regression is a relativley simple model, random forests are not. That means that many decisions need to be made about some non-trainable parameters (called *hyper-parameters*) that will be decided before the model is trained. These include:

- *max_depth*: The maximum depth of the tree
- *n_estimators*: The numbers of trees in the forests
- *max_features*: The % of variables to be used

But how do we know which are the *optimal* value of these parameters?

The most intuitive way to achieve this is to use all the possible combinations of the hyper-parameters and then choose the model with the highest testing performance! This should add another *for loop* in our operations, right?

Sklearn have us covered, using the `GridSearchCV` functionality! It works the same way as the `cross_val_score`, but with the addition of a *grid*; a dictionary of all potential values for all the *hyper-parameters*. 

Let's construct our grid first and then put it all together!

In [42]:
# Define the parameter grid for GridSearchCV
param_grid = {
    'classifier__max_depth': [2,3],
    'classifier__n_estimators': [10, 50, 100],  
    'classifier__max_features': [0.75, 1]
}


See the naming convention of the *hyper-parameters*? This specifies which part of the pipeline the  hyperparameter belong to. This happens so that you can try combination with hyper parameters that belong to the transformers as well!

You could, for examle, compare differnt imputation methods or what the effects of creating an extra column in OHE for your unkowns vs dropping them! Even thought this falls outside the socpe of this tutorial, you can combine *hyper-parameters* for every part of your classifier!

Now let's focus on the grid created for the Random Forests and fit it!

In [45]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(ml_pipeline, param_grid, scoring="f1", refit=True)
grid_search.fit(x,y.values.ravel())

And now that all have came together, lets explore this `grid_search` object! The key atribute is `.cv_results_` which sumamarizes everything the object has captured.

In [47]:
grid_search.cv_results_

{'mean_fit_time': array([0.01315045, 0.02413034, 0.04445066, 0.00683346, 0.02301197,
        0.04257693, 0.00716472, 0.02438636, 0.04537168, 0.00690422,
        0.02337527, 0.04306636]),
 'std_fit_time': array([0.00918085, 0.00046424, 0.00035663, 0.00026913, 0.00031273,
        0.0002024 , 0.00015765, 0.00059123, 0.00038773, 0.00010996,
        0.00019654, 0.00043401]),
 'mean_score_time': array([0.00296817, 0.00317569, 0.00373278, 0.00240221, 0.0030776 ,
        0.00387626, 0.00238824, 0.00305638, 0.00368462, 0.00233698,
        0.00318909, 0.00381503]),
 'std_score_time': array([5.09752228e-04, 3.86037021e-04, 1.79815261e-04, 1.42607207e-04,
        1.60145289e-04, 7.90804489e-05, 1.21369155e-04, 2.53724268e-04,
        1.26439610e-04, 1.79995701e-04, 2.36893401e-04, 2.90767526e-04]),
 'param_classifier__max_depth': masked_array(data=[2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3],
              mask=[False, False, False, False, False, False, False, False,
                    False, False, Fals

But this one seems a bit chaotic; that's why there are a few objects of interest:

- `.best_params_`: The set of the best *hyper-parameters*
- `.best_score`: The test score of the classifier using the best params
- `.best_estimator`: As we defined *refit=True*, the best classifier is refitted using the 

# Q&A

I hope you enjoyed this tutorial on Introduction to Machine Learning using Python! If you have any question, that's thew best time to ask!

Fire away!

In case something comes to your mind later, you can reach me on leosouliotis@gmail.com