Types of MAchine Learning Tasks

## Learning Paradigms and Machine Learning Tasks

Machine learning tasks can be categorized according to the learning paradigm and the specific goal they pursue:

* Supervised Learning:


* Unsupervised Learning:


* Semi-Supervised Learning:


* Self-Supervised Learning:


* Reinforcement Learning:


There are several different types of tasks in machine learning, broadly categorized by the learning paradigm and the specific goal. Here's a breakdown of the main categories and some common tasks within them:   

* Supervised Learning: In supervised learning, the algorithm learns from labeled data, meaning each data point is associated with a known output or target variable. The goal is to learn a mapping function that can predict the output for new, unseen input data.   

   * Classification: The goal is to assign data points to predefined categories or classes. The output variable is discrete.
      * Binary Classification: Predicting one of two classes (e.g., spam/not spam, cat/dog).   
      * Multi-class Classification: Predicting one of more than two classes (e.g., identifying different types of flowers, classifying news articles into topics).   
  
   * Regression: The goal is to predict a continuous numerical value. The output variable is continuous.
      *Linear Regression: Predicting a value based on a linear relationship with the input features (e.g., predicting house prices based on size).   
      * Polynomial Regression: Predicting a value based on a polynomial relationship with the input features.   
      * Time Series Forecasting: Predicting future values in a sequence based on past values (e.g., predicting stock prices, weather forecasting).   
  
* Unsupervised Learning: In unsupervised learning, the algorithm learns from unlabeled data, without any explicit output or target variable. The goal is to discover hidden patterns, structures, or relationships in the data.   

   * Clustering: Grouping similar data points together based on their features without prior knowledge of the groups (e.g., customer segmentation, document analysis).
      * K-Means Clustering: Partitioning data into k clusters based on distance to centroids.   
      * Hierarchical Clustering: Creating a tree-like structure of clusters.   
      * Density-Based Clustering: Identifying clusters based on the density of data points.   
  
   * Dimensionality Reduction: Reducing the number of features in a dataset while preserving its essential information (e.g., data visualization, feature extraction).
      * Principal Component Analysis (PCA): Finding the principal components that capture the most variance in the data.   
      * t-distributed Stochastic Neighbor Embedding (t-SNE): Reducing dimensionality for visualizing high-dimensional data.   
  
   * Association Rule Mining: Discovering interesting relationships or associations between variables in large datasets (e.g., market basket analysis).
      * Apriori Algorithm: Finding frequent itemsets in transactional data.   
      * Eclat Algorithm: Another algorithm for finding frequent itemsets.   
  
   * Anomaly Detection: Identifying data points that deviate significantly from the normal behavior or patterns in the data (e.g., fraud detection, fault detection).   

* Reinforcement Learning: In reinforcement learning, an agent learns to interact with an environment by taking actions and receiving rewards or penalties. The goal is for the agent to learn an optimal policy (a mapping from states to actions) that maximizes the cumulative reward over time.   

   * Control Tasks: Learning to control a system or agent to achieve a specific goal (e.g., robotics, autonomous driving).
   * Game Playing: Training agents to play games against opponents (e.g., AlphaGo, Atari games).   
   * Recommendation Systems: Optimizing recommendations based on user feedback and rewards.   
   * Resource Management: Learning to allocate resources efficiently.   

* Other Types of Machine Learning Tasks (Sometimes Considered Subcategories or Hybrid Approaches):

   * Semi-Supervised Learning: Learning from a combination of labeled and unlabeled data. This is useful when labeling data is expensive or time-consuming.   
   * Self-Supervised Learning: Learning from unlabeled data where the labels are generated from the data itself through a pretext task (e.g., predicting a missing part of an image). The learned representations can then be used for downstream supervised tasks.   
   * Machine Translation: Translating text from one language to another. This can be approached with sequence-to-sequence models in supervised learning.   
   * Transcription: Converting unstructured data like audio into a structured format like text (e.g., speech recognition).   
Synthesis and Sampling: Generating new data samples that are similar to the training data (e.g., generating images, text, or music).

The choice of machine learning task depends heavily on the nature of the data, the problem you are trying to solve, and the desired outcome. Understanding these different types of tasks is fundamental to applying machine learning effectively.

Datasets:

* Multi-class Classification Task &rarr; Iris Dataset
   * Number of Instances: 150
   * Number of Features: 4 (sepal length, sepal width, petal length, petal width)
   * Number of classes: 4
* Binary Classification Task &rarr; Breast Cancer Wisconsin (Diagnostic) Dataset
   * Number of Instances: 569
   * Number of Features: 30 (real-valued features computed from digitized images of cell nuclei)
   * Number of classes: 2 (Malignant/Benign)
* Regression Task &rarr; Wine Quality Dataset
   * Number of Instances: 4,900
   * Number of Features: 11 (physicochemical tests)
   * Prediction: quality score
* 

## Data Spliting

Divide a dataset into two or more subsets to train, validate, and evaluate a *Machine Learning* model.

* **Training Set**: The largest portion of the data (tipically 70-80%), used to train the model (learn **patterns and relationships** in the data).

* **Validation/Development  Set**: A smaller portion of the data (tipically 10-15%), used to tune the model's hyperparameters and assess its performance during training. It helps prevent **overfitting** by providing an **unbiased** evaluation while adjusting the model.

* **Test  Set**: Another smaller portion of the data (tipically 10-15%), used for the final evaluation of the trained model's performance. It provides an unbiased estimate of how well the model will perform on completely new, unseen data. The model **does not learn from this set** during training or hyperparameter tuning.

### Importance of Data Spliting 

* **Prevent Overfitting**: By evaluating the model on data unseen during training (validation/test sets), we can check if the model has **learned generalizable patterns** (instead of **memorizing** the training data itself).
  
* **Model Selection and Hyperparameter Tuning**: The validation set allows us to compare different models and their configurations (hyperparameters) to choose the one that performs best on unseen data.

* **Assess Generalization**: The test set provides a final, unbiased estimate of the model's **ability to generalize to new data**.

### Data Splitting Techniques

* **Train-Test Split**: The simplest (and possibly wrong) method. Does not allow hyperparameter tuning. 

* **Train-Validation-Test Split**: The most common method. Divides the data into three sets for training, hyperparameter tuning, and final evaluation. 

* **K-Fold Cross-Validation**: The dataset is divided into $k$ equal-sized *folds*. The model is trained and evaluated $k$ times, with each fold serving as the test set once and the remaining k−1 folds used for training. The performance is then averaged across all k evaluations. This provides a more robust estimate of performance, especially with smaller datasets.

* **Time Series Split**: Used for time-dependent data, where the data is split chronologically. The training set consists of earlier time periods, and the test set consists of later time periods, preventing *looking into the future* during training. 