<center><img src="https://is1-ssl.mzstatic.com/image/thumb/Purple122/v4/05/e7/67/05e76784-3364-b535-7e20-b3f4946a56b6/AppIcon-0-0-1x_U007emarketing-0-0-0-7-0-0-sRGB-0-0-0-GLES2_U002c0-512MB-85-220-0-0.png/434x0w.webp" style="height:150px"></center>


## <p style="text-align:center"><span style="color:blue">Machine Learning with Scikit-learn - Introduction </span></p>

**Authors** : Saad LABRIJI, Pierre ADEIKALAM

# Introduction

Machine learning is a branch of artificial intelligence that focuses on the development of statistical models to enable computers to learn from and make predictions based on data.

There are different types of machine learning algorithms:

* **Supervised learning**: Learning from **labeled** data to make predictions or decisions. Examples: regression, classification.

* **Self-supervised learning**: Learn from **unlabeled** data with labels that models generate themselves. Examples: Stable diffusion, generative pre-training (LLMs), etc.

* **Unsupervised learning**: Learning from **unlabeled** data to discover patterns or relationships. Examples: clustering, dimensionality reduction.

* **Reinforcement learning**: Learning **through trial and error** to maximize rewards. Examples: game playing, robotics, reinforcement learning with human feedback, etc.

Scikit-learn is a powerful and user-friendly machine learning library for Python for classical supervised and unsupervised learning, which are the most common use-cases for machine learning in a business setting.

# The supervised machine learning workflow

The first step to any machine learning project is **Data collection and preprocessing**:

* **Gathering data**: The first step in any machine learning project is to obtain relevant data. This can be done through various sources such as databases, APIs, or data scraping. A lot of companies struggle already at this step because their data is hard to access or spread out across multiple sources that can be both online or offline. Another issue can be that you know what you want to predict but you do not have the target labels, thus rendering your entire project unviable.


* **Data exploration**: Once the data is collected, it's crucial to explore and understand its characteristics. This includes examining the structure, size, types of variables, and any missing or inconsistent values. The reason we do this is to make sure the data has something to ***learn from*** and is not pure noise, as ML models cannot detect or invent patterns that are not in the data.


* **Data cleaning**: In this step, we handle missing data, outliers, and any inconsistencies in the dataset. Techniques for data cleaning include missing value imputation, removing outliers, and data transformation.


* **Feature selection**: If the dataset contains a large number of features (columns), we might want to select a subset of the most relevant ones. Simplifying a modelling problem is always a good idea as it improves a model's longterm performance.


* **Feature engineering**: Sometimes, the existing features may not be represented in a way that a machine learning model can understand. For example, string variable need to be encoded numerically before being passed to a machine learning model. Feature engineering involves creating new features derived from the existing ones to better represent the data.

Then, we proceed to **Model training**:

* **Splitting the data**: Before training a model, we need to split the dataset into two parts: the training set and the testing set. The training set is used to train the model, while the testing set is used to evaluate its performance. A common split is 70% for training and 30% for testing, but this can vary depending on the size of the dataset and the specific problem. We need to make sure the testing data is left untouched until the very end so that our estimation of the model's actual performance is accurate.


* **Choosing an algorithm**: Based on the problem type (regression, classification, etc.) and the characteristics of the data, we select an appropriate machine learning algorithm. Each algorithm has its own assumptions and mathematical principles, but in practice we select our models based on their expected performance for given problem.


* **Training the model**: Using the training set, the selected algorithm learns the underlying patterns in the data. The model adjusts its internal parameters to minimize the difference between the predicted outputs and the actual labels.


* **Model validation**: After training, it is essential to evaluate the model's performance. This can be done by applying it to the testing set and comparing the predicted outputs with the actual labels. Evaluation metrics such as mean squared error, accuracy, precision, recall, and F1 score are used to assess the model's effectiveness.


Finally, we proceed to **Model Evaluation and Deployment**:

* **Evaluation metrics**: We are going to **evaluate the impact the model will have on our business**. For example, if we train a model to predict customer churn, then we will want to estimate how many actual churning custormers we will be able to detect once we start using the model and thus **how much money the model is going to make/save**.


* **Deployment** : We will rewrite our machine learning code so that we can ship it and use it on the cloud in a virtual machine or another environment specifically built for running ML models (Azure ML, AWS Sagemaker, Google Vertex, etc). Deploying ML models is very hard in practice because most Data Scientists produce buggy code and once a model is in production their performance starts to degrade quickly because of distribution shifts. Deploying and maintaining ML models in production is its own specific field separate from Data Science called **MLOps**,  which stands for *Machine Learning Operations*.

We will not cover ML deployment in this notebook but you can find a lot of free ressources online if you want to learn about this subject.

# Part 1: Data Collection and preprocessing


### Load and quick data exploration

We will work on the classical titanic dataset, where we try to predict whether a passenger survived the tragedy or not.

We provide the dataset in the **`"titanic_train.csv"`** file. 

* Import the `pandas` library under the alias **`pd`**.
* Load the titanic dataset into a `DataFrame` using the **`pd.read_csv()`** function.
* Have a look at the first rows using the **`.head()`** method and try to guess what information each column contains.
* Compute the number of missing values for each column (Use the **`.isna()`** and **`.sum()`** methods).

The target variable is **`"Survived"`**, does it contain any missing values?

In [None]:
# Insert your code here




In [None]:
# Survived : Survival, 0 = No, 1 = Yes
# Pclass : Ticket class, 1 = 1st, 2 = 2nd, 3 = 3rd
# Sex : Sex, male or female
# Age: Age in years
# Sibsp: # of siblings / spouses aboard the Titanic
# Parch: # of parents / children aboard the Titanic
# Ticket: Ticket number
# Fare: Passenger fare
# Cabin: Cabin number
# Embarked: Port of Embarkation, C = Cherbourg, Q = Queenstown, S = Southampton

We see that the dataset has a lot of missing values on the `"Age"` and `"Cabin"` columns, which we will handle later.

Let us now assess if there is anything to learn from in the data. If you saw the titanic movie, then you know that women and children were evacuated from the ship first, thus they should be more likely to survive.

* Create a new column named **`IsAdult`** which indicates whether the person is an adult or a minor. Use the **`.apply()`** method or a boolean expression.

* Using a **groupby** operation on the columns **`Sex`** and **`IsAdult`**, compute the probability of survival of each group (This can be computed as the **mean** of the **`Survived`** column since it is boolean).





In [None]:
# Insert your code here




We can see that female adults and children were a lot more likely to survive than male passengers. Also, the probabily of survival is slightly different between minors and adults.

Since the survival distributions are so different, then from the sex and age alone we can tell that a **machine learning could use this pattern to make a pretty good prediction** whether someone survived or not. This tells us that the data has information rich enough to train a machine learning model.

### Handling missing data

We started by examining the dataset for any missing values, which are represented as NaN (Not a Number) or other placeholders.

In this case we have two options:

* **Drop missing values**: If the number of missing values is small compared to the size of the dataset, you can choose to drop the rows or columns containing missing values. However if the number of missing values is large in a given column, then you can decide to drop the entire column to save as many rows as possible.


* **Imputation** (Filling missing values): If dropping missing values would result in significant data loss, imputation techniques can be used to fill in the missing values. Common methods include filling empty values with the mean, median, mode, or using more advanced techniques like regression imputation or K-nearest neighbors imputation, which basically trains a separate machine learning model to predict the missing values from the other columns.

Filling missing values with a constant can be done using the **`.fillna()`** method. 

You can decide to drop all rows containing missing values using the **`.dropna()`** method.

You can also decide to drop an entire column using the **`.drop("ColumnName", axis = 1)`** command.

* Reload the titanic dataset.


* Fill the missing values of the **`"Age"`** column using its **mean**.


* Fill the missing values of the **`"Embarked"`** column using its **mode** (Use the `.mode()[0]` command).


* Drop the **`"Cabin"`** column.


* Verify that the `DataFrame` no longer contains missing values.

In [None]:
# Insert your code here




### Handling non-numerical data

A machine learning model cannot directly consume data which is not numerical. Therefore, we need to transform as much non-numerical data into a numerical representation so that we can feed it to the model.

There are many ways to go about this, but the simplest form of encoding categorical data is to use **One-Hot encoding** which consists in replace each category by a binary variable. 

For example, we can replace `male` and `female` by `0` and `1` in the `Sex` column, or replace `'S'`, `'C'` and `'Q'` by 3 columns whose value will be 0 or 1 depending on if the original value is `'S'`, `'C'` and `'Q'`.

Original Data : 

| Sex    | Embarked   |
|:-------|:-----------|
| male   | S          |
| female | S          |
| female | S          |
| male   | C          |
| male   | Q          |

One-hot encoded Data (The `"Sex_male"` column is unnecessary because it is redundant with the `"Sex_female"` column):

|   Sex_female |   Sex_male |   Embarked_C |   Embarked_Q |   Embarked_S |
|-------------:|-----------:|-------------:|-------------:|-------------:|
|            0 |          1 |            0 |            0 |            1 |
|            1 |          0 |            0 |            0 |            1 |
|            1 |          0 |            0 |            0 |            1 |
|            0 |          1 |            1 |            0 |            0 |
|            0 |          1 |            0 |            1 |            0 |

To perform this transformation, we will use the **`OneHotEncoder`** class from **`scikit-learn`**.

This class is part of a larger class of scikit-learn called **`Transformers`**. Every scikit-learn transformer behaves the same way:

```py
# Import the transformer from its module
from sklearn.some_module import SomeTransformer

# Instantiate the transformer object
transformer = SomeTransformer(some_transformation_argument)

# Apply the transformation to the data
transformed_data = transformer.fit_transform(some_data)
```

Example of scikit-learn transformers : `OneHotEncoder`, `MinMaxScaler`, `LabelEncoder`, `QuantileTransformer`, `SelectKBest`, `PolynomialFeatures`, etc....

First, some cleaning:

* Create a `DataFrame` named **`X`** which contains **every column except `Survived`**. This will be our feature matrix.


* Create a series named **`y`** which contains **only** the column **`Survived`**.


* Remove the columns **`PassengerId`**, **`Name`** and **`Ticket`** from the `DataFrame` **`X`** as we have no use for them.

In [None]:
# Insert your code here




* Import the **`OneHotEncoder`** class from the **`sklearn.preprocessing`** module.


* Instantiate a **`OneHotEncoder`** object with the argument **`sparse_output=False`**.


* Apply the transformation to the columns **`Pclass`**, **`Sex`** and **`Embarked`** of **`X`** and store the result in a variable named **`X_encoded`**.


At the beggining of your code you can insert the following snippet : 

```py
from sklearn import set_config
set_config(transform_output = "pandas")
```

This will make sure scikit-learn transformers return a **`pandas`** DataFrame as output and not a **`numpy`** array which are clunky to work with for data analysis or machine learning.

In [None]:
from sklearn import set_config
set_config(transform_output = "pandas")

# Insert your code here




* Drop the columns **`Pclass`**, **`Sex`**, **`Embarked`** from **`X`**.


* **Concatenate** **`X`** and **`X_encoded`** using the command **`pd.concat([X, X_encoded], axis=1)`** and store the result in a `DataFrame` named **`X_clean`**.

In [None]:
# Insert your code here




**Congratulations! Your dataset is ready to be trained on!**


# Part 2: Model Training and Evaluation

Training a supervised machine learning model with scikit-learn is probably the easiest part of this notebook.

The steps are very simple:

```py
# Import the model from a scikit-learn submodule
from sklearn.some_module import SomeModel

# Instantiate the model
model = SomeModel(some_learning_parameters)

# Train the model to predict the target variable y_train from the feature matrix X_train 
model.fit(X_train, y_train)
```

However, we will need to **evaluate** the performance of the model. We cannot evaluate the performance of the model on its training data because of **overfitting**.

Overfitting is **when a model learns a dataset too well such that it cannot apply what it has learned to new data**. If we evaluate the model on its training data, **we might think that it is better than what it really is**. 

A good analogy is imagining a child learning to add numbers. They learn that 2+2 is 4, and 4-1 is 3 (quick maths). However, if then you ask the child to calculate 5+4, then they might not know the result because they did not really understand how addition works even though they know that 2+2 is equal to 4.

A machine learning model that is complex enough can learn any dataset by heart, and sometimes that's a good thing (Outlier or Anomaly detection is a good usecase for that), but most of the time we want the model to learn the underlying **patterns** in the data and not the data itself. 

For this reason, we will create a dataset made specifically to measure how well our model performs on unseen data. We call this dataset the **validation** dataset.

To this end, we like to use the convenient **`train_test_split()`** function from scikit-learn:

```py
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.2)
# The `test_size` parameter tells us how much of the data we want to
# save as validation data (In this case we save 20% of the data).
```



This command will split the dataset `(X, y)` into a training dataset `(X_train, y_train)` and a validation dataset `(X_val, y_val)`.

As their names suggest, `(X_train, y_train)` will be used for training while `(X_val, y_val)` will be used for evaluation.

As an exercise, do the following:

* Import the **`train_test_split`** function from the **`sklearn.model_selection`** module.


* Import the **`LogisticRegression`** class from the **`sklearn.linear_model`** module.


* Import the **`KNeighborsClassifier`** class from the **`sklearn.neighbors`** module.


* Split the dataset **`(X, y)`** into a training dataset `(X_train, y_train)` and a validation dataset `(X_val, y_val)` such that `(X_val, y_val)` **contains 15% of the dataset**.


* Instantiate and store a **`LogisticRegression`** object and a **`KNeighborsClassifier`** object without specifying any arguments.


* Train both models on `(X_train, y_train)`.

In [None]:
# Insert your code here




In order to evaluate the models, we must choose a metric to measure. 


Since this is a **classification** problem, an easy metric to use is the **accuracy**, which simply tells us the ratio of correct predictions versus incorrect predictions. 

Another classical metric that we use for classification is the **recall**, which in this case would be the ratio between the number of survivors that the model accurately classified as such over the total number of survivors (we only care about detecting survivors).

Again, scikit-learn comes will several metrics which all behave the same way, making it easy to evaluate models:

```py
from sklearn.metrics import some_metric

# Get the model's predictions on the validation data
y_pred = model.predict(X_val)

# Compute the metric. The true labels (y_val) is always the first argument.
some_metric(y_val, y_pred)
```

* Import the **`accuracy_score`** and **`recall_score`** metrics from the **`sklearn.metrics`** module.


* Retrieve both models' predictions on the validation data.


* Compute and display the accuracy and recall that you obtained with each model.

In [None]:
# Insert your code here




We see that each model performs best on a given metric. Therefore, the next step would be to decide which metric we care most about and which model would have the most positive business impact, which is a big part of a Data Scientist's job. However, to keep this notebook short, we will not go into further details.

Congratulations on training and evaluating your first supervised machine learning models!

There is still a lot to learn to be a proper ML developper, but for the Hackathon it should be enough. If you are interested, here are a few things worth learning about for now:

* Model Hyperparameters.
* Cross-Validation.
* Hyperparameter optimization.
* Scikit-learn Pipelines.
* Ensemble and boosting models.

# Part 3: Clustering

Clustering is an **unsupervised** machine learning technique that involves **grouping similar samples** together based on their characteristics or patterns in the data.

Clustering is used mostly for data analysis but some of its techniques can also be used for recommendation engines, for image segmentation or for anomaly detection for example.

The basic clustering workflow is the following:

* Define features you want your clustering model to use to detect **similar** samples.
* Perform clustering using a library such as scikit-learn.
* Analyse the results of the clustering to understand how the clustering was performed. This will tell you the similarities between your samples inside a given cluster and how samples in two different clusters differ.

Let us take a simple example. Imagine you are an e-commerce website and you have the following data about your customers:

* Sex
* Age
* Average Order Value
* Average Browsing Time
* Average Monthly Browsing Sessions
* Preferred Item Category

If you perform clustering on this dataset, you might see that people of different genders and of different age have drastically different purchasing patterns. For example, after clustering you may notice that you have a cluster of females of age 15-25 that spend much more time browsing than average before making a purchase.

Using this information, your business can try to reduce the browsing time before purchasing of the customers in this specific cluster by sending them a discount code that is valid for a very limited time. This would be a very effective use of your marketing budget as you only send the discount code to people you are actively targeting.

### K-Means clustering

The K-Means algorithm is a **partition-based** clustering method that aims to divide the data into **`K`** **non-overlapping clusters**. 

Each cluster is initiated with a **centroid**, which represents the "average sample" inside a cluster.

The algorithm then iteratively assigns each data point to the nearest centroid while **minimizing the entropy inside each cluster** and **maximizing the entropy between clusters**. Centroids are then updated based on the mean of the assigned points.

<img src="https://miro.medium.com/v2/resize:fit:1400/1*b2sO2f--yfZiJazc5rYSpg.gif" width="500px">

To assign a new sample to a cluster, you simply look for the centroid closest to the sample and assign its cluster to the sample.

This algorithm is one of the simplest forms of clustering that exists, which also makes it very effective.

In order to use the K-Means algorithm (or any other clustering algorithm scikit-learn offers), you simply have to do the following:

```py
from sklearn.cluster import KMeans

# Instantiate the KMeans object with 2 centroids.
kmeans = KMeans(n_clusters=2)

# Train the clustering algorithm
kmeans.fit(X)
```


Let us experiment with KMeans on toy datasets.

* Run the following cell to instantiate the toy datasets and visualize them. Expected clusters will be shown as different colors.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import make_circles, make_blobs, make_moons

plt.subplot(1, 3, 1)
# Generate the dataset and plot it
X_circ, y_circ = make_circles(n_samples = 1000, noise = 0.1, factor = 0.3, random_state = 2)
plt.scatter(X_circ[:, 0], X_circ[:, 1], c = y_circ)

# Compute the expected centroids
df = pd.concat([pd.DataFrame(X_circ), pd.DataFrame(y_circ, columns = ["cluster"])], axis = 1)
centroids = df.groupby("cluster").mean()
plt.scatter(centroids[0], centroids[1], c = "red", s = 100)

plt.axis("off")
plt.title("Circles")

plt.subplot(1, 3, 2)
X_blob, y_blob = make_blobs(n_samples = 1000, random_state=31)
plt.scatter(X_blob[:, 0], X_blob[:, 1], c = y_blob)

# Compute the expected centroids
df = pd.concat([pd.DataFrame(X_blob), pd.DataFrame(y_blob, columns = ["cluster"])], axis = 1)
centroids = df.groupby("cluster").mean()
plt.scatter(centroids[0], centroids[1], c = "red", s = 100)

plt.axis("off")
plt.title("Blobs")

plt.subplot(1, 3, 3)

X_moons, y_moons = make_moons(n_samples = 1000, noise = 0.1, random_state = 4)
plt.scatter(X_moons[:, 0], X_moons[:, 1], c = y_moons)

df = pd.concat([pd.DataFrame(X_moons), pd.DataFrame(y_moons, columns = ["cluster"])], axis = 1)
centroids = df.groupby("cluster").mean()
plt.scatter(centroids[0], centroids[1], c = "red", s = 100)

plt.axis("off")
plt.title("Moons")


fig = plt.gcf()
fig.set_size_inches(16, 4)

We created 3 datasets for each case : 
* `X_circ` : Circles dataset.
* `X_blob`: Blobs dataset.
* `X_moons` : Moons dataset.

For each dataset, we will train a KMeans model and display the resulting clusters with its centroids.

In [None]:
from sklearn.cluster import KMeans

# Instantiate and train the clustering models
km_circ = KMeans(n_clusters=2, n_init="auto").fit(X_circ)
km_blob = KMeans(n_clusters=3, n_init="auto").fit(X_blob)
km_moons = KMeans(n_clusters=2, n_init="auto").fit(X_moons)

# Make the clustering predictions
y_pred_circ = km_circ.predict(X_circ)
y_pred_blob = km_blob.predict(X_blob)
y_pred_moons = km_moons.predict(X_moons)

# Plot the predicted clusters
plt.subplot(1, 3, 1)
plt.scatter(X_circ[:, 0], X_circ[:, 1], c = y_pred_circ)
plt.scatter(km_circ.cluster_centers_[:, 0], km_circ.cluster_centers_[:, 1], c = "red", s = 100)
plt.axis("off")
plt.title("Circles")

plt.subplot(1, 3, 2)
plt.scatter(X_blob[:, 0], X_blob[:, 1], c = y_pred_blob)
plt.scatter(km_blob.cluster_centers_[:, 0], km_blob.cluster_centers_[:, 1], c = "red", s = 100)
plt.axis("off")
plt.title("Blobs")

plt.subplot(1, 3, 3)
plt.scatter(X_moons[:, 0], X_moons[:, 1], c = y_pred_moons)
plt.scatter(km_moons.cluster_centers_[:, 0], km_moons.cluster_centers_[:, 1], c = "red", s = 100)
plt.axis("off")
plt.title("Moons")

fig = plt.gcf()
fig.set_size_inches(16, 4)

As you can see, the model did not perform well on the Circles dataset or the Moons dataset. This is because KMeans uses distance as a similarity measure and will cluster points that are close to each other in terms of euclidian norm. For this reason, clusters that do not behave like blobs (i.e. non-convex clusters) will not be properly detected by KMeans.

Let us try a fancier technique, named Spectral Clustering.

In [None]:
from sklearn.cluster import SpectralClustering
import warnings
warnings.filterwarnings("ignore")

# Instantiate and train the clustering models
sc_circ = SpectralClustering(n_clusters=2, affinity = "nearest_neighbors", random_state = 1).fit(X_circ)
sc_blob = SpectralClustering(n_clusters=3, affinity = "nearest_neighbors", random_state = 1).fit(X_blob)
sc_moons = SpectralClustering(n_clusters=2, affinity = "nearest_neighbors", random_state = 1).fit(X_moons)

# Spectral clustering does not have a "predict" method, we need to 
# retrieve the labels manually using the labels_ attribute
y_pred_circ = sc_circ.labels_.astype(int)
y_pred_blob = sc_blob.labels_.astype(int)
y_pred_moons = sc_moons.labels_.astype(int)

# Plot the predicted clusters
plt.subplot(1, 3, 1)
plt.scatter(X_circ[:, 0], X_circ[:, 1], c = y_pred_circ)

# Compute the centroids using the labels
df = pd.concat([pd.DataFrame(X_circ), pd.DataFrame(sc_circ.labels_, columns = ["cluster"])], axis = 1)
centroids = df.groupby("cluster").mean()
plt.scatter(centroids[0], centroids[1], c = 'red', s = 100)

plt.axis("off")
plt.title("Circles")

plt.subplot(1, 3, 2)
plt.scatter(X_blob[:, 0], X_blob[:, 1], c = y_pred_blob)

# Compute the centroids using the labels
df = pd.concat([pd.DataFrame(X_blob), pd.DataFrame(sc_blob.labels_, columns = ["cluster"])], axis = 1)
centroids = df.groupby("cluster").mean()
plt.scatter(centroids[0], centroids[1], c = 'red', s = 100)

plt.axis("off")
plt.title("Blobs")

plt.subplot(1, 3, 3)
plt.scatter(X_moons[:, 0], X_moons[:, 1], c = y_pred_moons)

# Compute the centroids using the labels
df = pd.concat([pd.DataFrame(X_moons), pd.DataFrame(sc_moons.labels_, columns = ["cluster"])], axis = 1)
centroids = df.groupby("cluster").mean()
plt.scatter(centroids[0], centroids[1], c = 'red', s = 100)

plt.axis("off")
plt.title("Moons")

fig = plt.gcf()
fig.set_size_inches(16, 4)

Spectral Clustering seems to be working! Well not really since we kind of cheated to make it work by knowing in advance which parameters work best for these toy datasets.

The truth is that it is next to impossible to know which clustering technique is best, especially when working with high dimensional data, which is why unless we have a very specific usecase in mind it is hard to evaluate and choose a clustering model.

Congratulations on finishing this notebook! We hope you were able to learn something from it and that it will be helpful for the Hackathon!

**Good luck!**