# **What is Machine Learning?**

Types of Machine Learning:
* **Supervised**: the *model is trained on labeled data*, meaning the training data is already tagged with the correct output. The algorithm learns to map the input to the output based on this labeled data, and it can use this knowledge to predict the output for new, unseen data.

* **Unsupervised**: the *model is trained on unlabeled data*, meaning the training data is not tagged with the correct output. The algorithm tries to find patterns or groupings in the data based on its own analysis of the input data.

* **Semi-Supervised**: Partial data’s with and without labels.

* **Reinforcement**: the *model learns by receiving feedback from its environment*. The algorithm learns to make decisions by trial and error, with the goal of maximizing rewards and minimizing penalties. This type of learning is often used in robotics, gaming, and autonomous vehicles.



# **Main Tools for ML in Python**

## **Numpy**

NumPy (Numerical Python) is a popular Python library used for scientific computing and data analysis. 

It **provides a high-performance multidimensional array object**, various tools for working with these arrays, and a large collection of mathematical functions to work on these arrays. 

NumPy is often used for numerical computations in fields such as data science, machine learning, engineering, and finance.

## **Pandas**

Pandas is an open-source data manipulation and analysis library for the Python programming language. 

It **provides data structures for efficiently storing and manipulating data in tabular form**, such as data frames and series. 

Pandas makes it easy to **clean**, **transform**, and **analyze** data, including handling missing values, **grouping** data, and performing **statistical operations**. 

It is a popular tool in data science and is often used in conjunction with other libraries such as NumPy, Matplotlib, and Scikit-learn.

## **Matplotlib**

Matplotlib is a Python library for creating static, animated, and interactive visualizations in Python. 

It is a popular data visualization library that provides a variety of tools and features for creating graphs, charts, histograms, scatterplots, and more. 

Matplotlib allows users to create high-quality plots with just a few lines of code and supports a wide range of customization options. 

It is widely used in data science, machine learning, and other scientific applications for data visualization and exploratory analysis.

## **Scikit-learn**

Scikit-learn is a free and open-source machine learning library for Python programming language. 

It **provides a wide range of algorithms for supervised and unsupervised learning**, including classification, regression, clustering, dimensionality reduction, model selection, and preprocessing. 

Scikit-learn also offers a variety of tools for model evaluation and selection, such as cross-validation, hyperparameter tuning, and performance metrics. 

It is widely used in academia and industry for building and deploying machine learning models, and it has a large and active community that contributes to its development and maintenance.

## **Keras**

Keras is a high-level neural networks API that can run on top of TensorFlow, CNTK, or Theano. 

It was developed with a focus on enabling fast experimentation and easy-to-use interface for deep learning models, while still being flexible enough to support complex research. 

With Keras, you can quickly build and train neural networks for classification, regression, and other types of machine learning tasks. 

Keras provides a wide range of pre-built layers, activation functions, loss functions, optimizers, and other components that make it easy to build, train, and deploy neural network models.

## **Scikit-learn vs Keras**

Keras and Scikit-learn are two popular machine learning libraries in Python, but they have different focuses and features:

* **Neural Networks vs Traditional ML Algorithms**: Keras is mainly focused on deep learning and neural networks, while Scikit-learn provides a wide range of traditional machine learning algorithms, such as decision trees, support vector machines, and random forests.

* **Level of Abstraction**: Keras is a high-level neural networks API that is built on top of TensorFlow, Theano, or CNTK. It allows users to quickly build and train neural networks with just a few lines of code, without worrying about the low-level details. On the other hand, Scikit-learn provides a lower-level interface that requires more code to be written, but provides more control over the model and training process.

* **Flexibility vs Ease of Use**: Keras is designed to be easy to use and beginner-friendly, with a simple API and many pre-built models and layers. However, this simplicity comes at the cost of flexibility, as it can be harder to customize and fine-tune models. Scikit-learn, on the other hand, is more flexible and provides more options for customization, but requires more coding and a deeper understanding of the algorithms.

* **Deep Learning vs Machine Learning**: Keras is focused mainly on deep learning and neural networks, which are best suited for complex tasks such as image classification, speech recognition, and natural language processing. Scikit-learn, on the other hand, is focused on traditional machine learning algorithms, which are best suited for simpler tasks such as regression, classification, and clustering.

* **Hardware Requirements**: Keras is designed to work with GPUs and other hardware accelerators, which can significantly speed up the training process for large datasets and complex models. Scikit-learn, on the other hand, can run on standard CPUs and does not require specialized hardware.

In summary, Keras and Scikit-learn are both powerful machine learning libraries, but they have different focuses, features, and trade-offs. Keras is best suited for deep learning tasks and beginners who want an easy-to-use interface, while Scikit-learn is best suited for traditional machine learning tasks and users who want more flexibility and control over their models.

# **Stages of an ML Project**

## **ETL & ELT**

|Extract, Transfom and Load|Extract, Load and Transform|
|:-------:|:-----:|
|TRANSFORM -> LOAD| LOAD -> TRANSFORM|
|Used for on-premises, relational and structured data|Used for scalable cloud structured and unstructured data sources|
|Mainly used for a **small amount of data**|Used for **large amounts of data**|
|Doesn’t provide data lake supports|Provides data lake support|
||**Big Data**|

## **Data Pipeline**

Data pipeline is a series of steps that are used to preprocess and transform raw data into a format that can be used by machine learning algorithms. 

A data pipeline typically involves **collecting** and **cleaning** data, **transforming** the data into a usable format, and **splitting** it into training and test sets. 

Other steps in a data pipeline might include feature engineering, scaling or normalizing the data, and encoding categorical variables. 

The goal of a data pipeline is to **automate the data preprocessing steps** and to ensure that the data is in a consistent and reliable format for use in machine learning models.

# **Train, Test and Validation**



In machine learning, it is essential to evaluate the performance of the model on data that it has not seen before to measure how well it generalizes to new data. 

For this reason, the data set is typically divided into three parts: training, validation, and test sets.

The **training set** is used to train the model.

The **validation se**t is used to *tune the model's hyperparameters and monitor its performance* during training.

The **test set** is the final evaluation set *used to estimate the model's performance* on new data.

-------

**Validation Set**: Used to *OPTIMIZE* model parameters.

**Test Set**: Used to get an unbiased *estimate of the final model performance*.

### Hyperparameters

Hyperparameters are parameters that are not learned by the model during training, but are set before training and influence the learning process. 

Hyperparameters can be thought of as configuration settings for the machine learning algorithm, such as the learning rate, the number of hidden layers in a neural network, the number of decision trees in a random forest, the regularization strength, etc. 

These parameters are set by the user or data scientist, based on their knowledge of the problem and experience with similar models. 

The performance of the model depends on the hyperparameters chosen, so tuning them carefully is important to achieve the best results.



# **Metrics**

## Accuracy

### Confusion Matrix

### Cross Validation