# **What is Machine Learning?**

Machine learning is a subset of artificial intelligence (AI) that focuses on developing algorithms and models that allow computer systems to learn from data and make predictions or decisions without being explicitly programmed. 

It is based on the idea that computers can learn patterns and relationships from data and use that knowledge to make informed decisions or predictions.

In machine learning, the goal is to build models that can automatically learn and improve from experience or data. 

This is typically done by training the model on a labeled dataset, where the input data is associated with known outputs or labels. 

The model learns from this labeled data to generalize patterns and relationships, and then it can make predictions or decisions on new, unseen data.

Types of Machine Learning:
* **Supervised**: the *model is trained on labeled data*, meaning the training data is already tagged with the correct output. The algorithm learns to map the input to the output based on this labeled data, and it can use this knowledge to predict the output for new, unseen data.

* **Unsupervised**: the *model is trained on unlabeled data*, meaning the training data is not tagged with the correct output. The algorithm tries to find patterns or groupings in the data based on its own analysis of the input data.

* **Semi-Supervised**: Partial data’s with and without labels.

* **Reinforcement**: the *model learns by receiving feedback from its environment*. The algorithm learns to make decisions by trial and error, with the goal of maximizing rewards and minimizing penalties. This type of learning is often used in robotics, gaming, and autonomous vehicles.



# **Main Tools for ML in Python**

## **Numpy**

NumPy (Numerical Python) is a popular Python library used for scientific computing and data analysis. 

It **provides a high-performance multidimensional array object**, various tools for working with these arrays, and a large collection of mathematical functions to work on these arrays. 

NumPy is often used for numerical computations in fields such as data science, machine learning, engineering, and finance.

## **Pandas**

Pandas is an open-source data manipulation and analysis library for the Python programming language. 

It **provides data structures for efficiently storing and manipulating data in tabular form**, such as data frames and series. 

Pandas makes it easy to **clean**, **transform**, and **analyze** data, including handling missing values, **grouping** data, and performing **statistical operations**. 

It is a popular tool in data science and is often used in conjunction with other libraries such as NumPy, Matplotlib, and Scikit-learn.

## **Matplotlib**

Matplotlib is a Python library for creating static, animated, and interactive visualizations in Python. 

It is a popular data visualization library that provides a variety of tools and features for creating graphs, charts, histograms, scatterplots, and more. 

Matplotlib allows users to create high-quality plots with just a few lines of code and supports a wide range of customization options. 

It is widely used in data science, machine learning, and other scientific applications for data visualization and exploratory analysis.

## **Seaborn**

is a Python data visualization library based on Matplotlib. 

It provides a high-level interface for creating informative and attractive statistical graphics. 

Seaborn is designed to work well with Pandas data frames and arrays, and it includes several built-in themes for creating visually appealing plots. 

Seaborn also includes functions for visualizing univariate and bivariate distributions, linear regression models, time series data, and categorical data, among other types of plots. 

Overall, Seaborn can make the process of creating complex visualizations in Python easier and more streamlined.

## **Scikit-learn**

Scikit-learn is a free and open-source machine learning library for Python programming language. 

It **provides a wide range of algorithms for supervised and unsupervised learning**, including classification, regression, clustering, dimensionality reduction, model selection, and preprocessing. 

Scikit-learn also offers a variety of tools for model evaluation and selection, such as cross-validation, hyperparameter tuning, and performance metrics. 

It is widely used in academia and industry for building and deploying machine learning models, and it has a large and active community that contributes to its development and maintenance.

## **Keras**

Keras is a high-level neural networks API that can run on top of TensorFlow, CNTK, or Theano. 

It was developed with a focus on enabling fast experimentation and easy-to-use interface for deep learning models, while still being flexible enough to support complex research. 

With Keras, you can quickly build and train neural networks for classification, regression, and other types of machine learning tasks. 

Keras provides a wide range of pre-built layers, activation functions, loss functions, optimizers, and other components that make it easy to build, train, and deploy neural network models.

## **Scikit-learn VS Keras**

Keras and Scikit-learn are two popular machine learning libraries in Python, but they have different focuses and features:

* **Neural Networks vs Traditional ML Algorithms**: Keras is mainly focused on deep learning and neural networks, while Scikit-learn provides a wide range of traditional machine learning algorithms, such as decision trees, support vector machines, and random forests.

* **Level of Abstraction**: Keras is a high-level neural networks API that is built on top of TensorFlow, Theano, or CNTK. It allows users to quickly build and train neural networks with just a few lines of code, without worrying about the low-level details. On the other hand, Scikit-learn provides a lower-level interface that requires more code to be written, but provides more control over the model and training process.

* **Flexibility vs Ease of Use**: Keras is designed to be easy to use and beginner-friendly, with a simple API and many pre-built models and layers. However, this simplicity comes at the cost of flexibility, as it can be harder to customize and fine-tune models. Scikit-learn, on the other hand, is more flexible and provides more options for customization, but requires more coding and a deeper understanding of the algorithms.

* **Deep Learning vs Machine Learning**: Keras is focused mainly on deep learning and neural networks, which are best suited for complex tasks such as image classification, speech recognition, and natural language processing. Scikit-learn, on the other hand, is focused on traditional machine learning algorithms, which are best suited for simpler tasks such as regression, classification, and clustering.

* **Hardware Requirements**: Keras is designed to work with GPUs and other hardware accelerators, which can significantly speed up the training process for large datasets and complex models. Scikit-learn, on the other hand, can run on standard CPUs and does not require specialized hardware.

In summary, Keras and Scikit-learn are both powerful machine learning libraries, but they have different focuses, features, and trade-offs. Keras is best suited for deep learning tasks and beginners who want an easy-to-use interface, while Scikit-learn is best suited for traditional machine learning tasks and users who want more flexibility and control over their models.

## **PyTorch**

is an open-source deep learning framework primarily developed by Facebook's AI Research lab (FAIR). 

It provides a Python interface and serves as a powerful tool for building and training deep learning models. 

PyTorch offers a dynamic computational graph, which allows for more flexible and intuitive model development compared to static graph frameworks.

## **TensorFlow**

is an open-source machine learning framework developed by Google. 

It is designed to facilitate the development and deployment of machine learning models, particularly deep learning models. 

TensorFlow offers a comprehensive set of tools, libraries, and resources for building and training various types of neural networks.

One of the key features of TensorFlow is its computational graph abstraction, which allows users to define and execute complex mathematical operations as a dataflow graph. 

This graph represents the computations as nodes and the data as edges, enabling efficient parallel execution on CPUs or GPUs.

TensorFlow provides a high-level API called Keras, which simplifies the process of building and training neural networks. 

# **Stages of an ML Project**

## **ETL & ELT**

|Extract, Transfom and Load|Extract, Load and Transform|
|:-------:|:-----:|
|TRANSFORM -> LOAD| LOAD -> TRANSFORM|
|Used for on-premises, relational and structured data|Used for scalable cloud structured and unstructured data sources|
|Mainly used for a **small amount of data**|Used for **large amounts of data**|
|Doesn’t provide data lake supports|Provides data lake support|
||**Big Data**|

## **Data Pipeline**

Data pipeline is a series of steps that are used to preprocess and transform raw data into a format that can be used by machine learning algorithms. 

A data pipeline typically involves **collecting** and **cleaning** data, **transforming** the data into a usable format, and **splitting** it into training and test sets. 

Other steps in a data pipeline might include feature engineering, scaling or normalizing the data, and encoding categorical variables. 

The goal of a data pipeline is to **automate the data preprocessing steps** and to ensure that the data is in a consistent and reliable format for use in machine learning models.

## **Data Lake**

Is a centralized repository of raw and unprocessed data in its native format. 

It is designed to store large volumes of structured, semi-structured, and unstructured data, including text files, images, videos, log files, sensor data, and more. 

Data lakes follow a schema-on-read approach, where data is stored as-is without any predefined schema or structure. 

This allows for flexibility and agility in data exploration and analysis. Data lakes are often used as a staging area for data before it is transformed and loaded into a data warehouse or used for other analytics purposes.

## **Data Warehouse**

A data warehouse is a centralized repository of structured and processed data that is optimized for querying and analysis. 

It is designed to support business intelligence and reporting activities. 

Data warehouses typically follow a schema-on-write approach, where data is preprocessed, transformed, and loaded into a predefined schema before it is stored. 

This structure allows for efficient querying and retrieval of data. 

Data warehouses are typically used to store structured data from various sources, such as transactional databases, and provide a unified view of the data for analysis and reporting purposes.



# **Train, Test and Validation**



In machine learning, it is essential to evaluate the performance of the model on data that it has not seen before to measure how well it generalizes to new data. 

For this reason, the data set is typically divided into three parts: training, validation, and test sets.

The **training set** is used to train the model.

The **validation se**t is used to *tune the model's hyperparameters and monitor its performance* during training.

The **test set** is the final evaluation set *used to estimate the model's performance* on new data.

-------
i.e.:
* **Validation Set**: Used to *OPTIMIZE MODEL PARAMETERS*.

* **Test Set**: Used to get an unbiased *estimate of the FINAL MODEL PERFORMANCE*.

### Hyperparameters

Hyperparameters are parameters that are not learned by the model during training, but are *set before training and influence the learning process*. 

Hyperparameters can be thought of as configuration settings for the machine learning algorithm, such as the learning rate, the number of hidden layers in a neural network, the number of decision trees in a random forest, the regularization strength, etc. 

These parameters are set by the user or data scientist, based on their knowledge of the problem and experience with similar models. 

The performance of the model depends on the hyperparameters chosen, so tuning them carefully is important to achieve the best results.



## **Splitting Data**


* **Training set**: Typically, the largest portion of the dataset is allocated to the training set. It is common to use 60-80% of the data for training the model.

* **Testing set**: It is important to have a separate set of data for testing to assess how well the model generalizes. A common practice is to allocate around 20-30% of the data to the testing set.

* **Validation set**: The size of the validation set can vary but is typically around 10-20% of the data.

In [None]:
from sklearn.model_selection import train_test_split
from keras.datasets import mnist as mnist_k

# Splitting in TRAINING and TESTING
(X_train_full, y_train_full), (X_test, y_test) = mnist_k.load_data()

# Split TRAINING set into TRAINING and VALIDATION sets
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full, test_size=0.2, random_state=42)

print(f"X_train_full: {len(X_train_full)}.")
print(f"y_train_full: {len(y_train_full)}.")
print("")
print(f"X_test: {len(X_test)}.")
print(f"y_test: {len(y_test)}.")
print("")
print(f"X_train: {len(X_train)}.")
print(f"X_valid: {len(X_valid)}.")
print(f"y_train: {len(y_train)}.")
print(f"y_valid: {len(y_valid)}.")


X_train_full: 60000.
y_train_full: 60000.

X_test: 10000.
y_test: 10000.

X_train: 48000.
X_valid: 12000.
y_train: 48000.
y_valid: 12000.


### About Validation

The validation set is typically used during the training process to tune hyperparameters and evaluate the model's performance on unseen data, while the testing set is used to give a final estimate of the model's performance.

The main difference between the two is that the validation set is used to adjust the model during training to improve its performance, while the testing set is only used once to evaluate the final performance of the model. 

The validation set is like a "practice run" that allows the model to adjust its parameters and improve its accuracy, while the testing set is the ultimate test of the model's generalization ability.

The validation dataset is used in machine learning to evaluate the performance of the model during training and to tune the hyperparameters of the model. 

It is typically used when the model is being trained on a separate training dataset and the performance is being evaluated on a separate test dataset. 

During the training process, the model is evaluated on the validation dataset after each epoch to check whether the model is overfitting or underfitting the training data. 

If the model is overfitting, the validation loss will be higher than the training loss, and adjustments to the model should be made to prevent overfitting.

# **Metrics**

A metric is a measure used to evaluate the performance of a machine learning model, providing a quantitative measure of how well a model is achieving its goal.

Being used to compare different models or simply to adjust hyperparameters of a model to improve its performance.

Some metrics can be used for more than one type of machine learning model, however we have listed a general idea of which metrics are used in each model:

Classification Metrics:

* **Accuracy**: The proportion of correct predictions out of the total number of predictions.
* **Precision**: The ability of the model to correctly identify positive instances out of the total instances predicted as positive.
* **Recall**: The ability of the model to correctly identify positive instances out of the total actual positive instances.
* **F1 Score**: The harmonic mean of precision and recall, providing a balanced measure of the model's performance.
* **ROC Curve and AUC**: Receiver Operating Characteristic curve and Area Under the Curve, which assess the model's trade-off between true positive rate and false positive rate.
* **Confusion Matrix**: A table that summarizes the model's predictions against the actual class labels, providing insights into true positives, true negatives, false positives, and false negatives.



Clustering Metrics:

* **Silhouette Score**: A measure of how similar an object is to its own cluster compared to other clusters.
* **Calinski-Harabasz Index**: A measure of the ratio between within-cluster dispersion and between-cluster dispersion.
* **Davies-Bouldin Index**: A measure of the average similarity between each cluster and the most similar cluster.
* **Rand Index**: A measure of the similarity between two data clusterings, taking into account true positive, true negative, false positive, and false negative counts.

Linear Regression Metrics:

* **Mean Absolute Error (MAE)**: The average absolute difference between the predicted and actual values.
* **Mean Squared Error (MSE)**: The average squared difference between the predicted and actual values.
* **Root Mean Squared Error (RMSE)**: The square root of the average squared difference between the predicted and actual values.
* **R-squared (R2)**: The proportion of the variance in the target variable that can be explained by the model.

## **Accuracy**

In a machine learning context, 'accuracy' is a measure of how well a model is able to correctly predict the target variable. It is the ratio of the number of correct predictions to the total number of predictions.

For example, if a classification model is used to predict whether an image shows a cat or a dog, and it makes 100 predictions, out of which 80 are correct, then the accuracy of the model is 80%.

Accuracy is one of the most commonly used evaluation metrics for classification problems. However, it may not be the best metric in all cases, especially when dealing with imbalanced datasets. In such cases, other metrics such as precision, recall, F1-score, and AUC-ROC may provide a better understanding of model performance.

## **Overfitting and Underfitting**


Overfitting and underfitting are concepts related to the performance of a machine learning model, which can be evaluated using various metrics such as accuracy, precision, recall, and F1 score.

**Overfitting** occurs when a model is trained too well on the training data, meaning it learns the noise and idiosyncrasies of the data rather than the underlying patterns, resulting in poor generalization to new data. This can be detected when the model performs very well on the training data but poorly on the test data.

**Underfitting**, on the other hand, occurs when a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both the training and test data. This can be detected when the model performs poorly on both the training and test data.

Therefore, overfitting and underfitting are not related to a specific metric but rather to the overall performance of the model on the training and test data.

## **Cross Validation**

Cross-validation is a technique in machine learning that is used to *evaluate the performance of a model on an independent dataset*. 

It involves partitioning the data into multiple folds, using one or more of these folds as a validation set while training the model on the remaining folds, and then repeating the process for each fold. 

This process is repeated k times, with each fold being used as the test set exactly once. 

This method provides an estimate of the model's performance that is *less sensitive to the way the data is split*, as it is averaged over multiple splits.

Cross-validation *provides a more reliable estimate of a model's performance on unseen data* than simply using a train-test split.

**cross_val_predict** and **cross_val_score** are two functions in scikit-learn library that are commonly used in machine learning for evaluating the performance of a model.

**cross_val_score** function computes the cross-validated scores for a given estimator, using a specified evaluation metric. It splits the data into training and testing subsets, fits the model on the training data, and evaluates its performance on the testing data using the specified scoring metric. The function returns an array of scores, one for each fold of the cross-validation. This function is useful for quickly evaluating the performance of a model on a single metric.

On the other hand, **cross_val_predict** function returns the predicted values for each data point when it is in the testing set of each fold. It can be useful for generating a set of predictions to be used in an ensemble method, for example. Unlike cross_val_score, cross_val_predict does not return a score, but a prediction for each sample. This function can be useful for understanding how the model is performing on different subsets of the data.

### cross_val_score VS accuracy_score

**accuracy_score** is useful when you have already split your data into training and testing sets, and you want to evaluate the accuracy of your model on the test set. This is a quick and simple way to get an estimate of the model's accuracy.

**cross_val_score**, on the other hand, is useful when you want to perform cross-validation on your data. Cross-validation is a technique that involves dividing the data into k-folds, training the model on k-1 folds, and testing it on the remaining fold. This process is repeated k times, so that each fold is used once as the validation set. The scores from each of the k iterations are then averaged to obtain an estimate of the model's accuracy.

Therefore, which one to choose depends on the context and the resources available. If you have a small dataset or limited computational resources, using accuracy_score on a single test set may be sufficient. If you have a larger dataset and want a more reliable estimate of the model's accuracy, using cross_val_score may be a better option.

## **Confusion Matrix**



A confusion matrix is a table used to evaluate the performance of a classification model. It is a table that summarizes the number of correct and incorrect predictions for each class of the model.

The table is composed of four different values:

* **True Positives (TP)**: the number of *correctly predicted positive* values.
* **True Negatives (TN)**: the number of *correctly predicted negative* values.
* **False Positives (FP)**: the number of *incorrectly predicted positive* values.
* **False Negatives (FN)**: the number of *incorrectly predicted negative* values.

These values are arranged in a matrix with the actual values of the classes forming the rows, and the predicted values forming the columns. The diagonal of the matrix represents the correctly classified instances, and the off-diagonal values represent the incorrectly classified instances.

The confusion matrix provides important metrics that help to evaluate the performance of a model, such as precision, recall, and F1-score. It is a useful tool for understanding how well the model is performing and for identifying which classes are being misclassified.