## Machine Learning 2 for Masters Students
# Topic: Multi-Target Prediction
### Group Members: Daniel Gombas, Amirhooshang Navaei
### Date: 03/19/2024
_____________________________________


## Section 1: Introduction

Multi-target prediction, also known as multi-output or multi-task learning, is an important area in machine learning that addresses problems where the goal is to predict multiple dependent variables (targets) simultaneously from a set of input variables (features). This approach is particularly relevant in complex real-world scenarios where multiple outcomes are interrelated and can influence each other.

### Problem Addressed:
**Complex Interdependencies**: Many real-world problems involve predicting multiple outcomes that may be related or dependent on each other. Traditional single-target prediction models might ignore these dependencies, potentially leading to suboptimal performance.  

**Efficiency**: By predicting multiple targets simultaneously, multi-target models can leverage shared information among targets, reducing the need to train separate models for each target. This can lead to more efficient use of computational resources and data.  

**Improved Generalization**: Multi-target models can potentially improve prediction accuracy by learning shared representations that capture underlying patterns relevant to multiple targets, thus enhancing the model's generalization capabilities.

### Intuition:
**Shared Representation**: The core intuition behind multi-target prediction is that the targets share some common underlying factors. By learning these shared representations, the model can make more informed predictions for each target.  

**Exploiting Correlations**: Multi-target models aim to exploit the correlations and interactions between targets. For example, in a health-related dataset, predicting multiple health outcomes simultaneously may yield better predictions than treating each outcome independently, as many health metrics are interrelated.  

**Joint Learning**: Multi-target prediction is essentially a form of joint learning, where the model learns to optimize the predictions for multiple targets in a coordinated manner. This joint learning approach can help in uncovering insights that may not be apparent when targets are considered in isolation.


## Section 2: Preliminaries

Two datasets have been considered for this project:

* Enzyme substrates - Multi-Label Classification
* Energy efficiency in Buildings - Multi-target Regression

## Section 3: Basic Use-case
The principles of multi-target prediction apply to many domains, such as genomics, finance, and recommendation systems, where multiple outcomes or variables are of interest are not independent of each other.  

One of the most prototypical and illustrative use cases of multi-target prediction is in environmental modeling, specifically in predicting various aspects of weather or climate conditions from a set of input variables. In this context, multi-target prediction models can simultaneously forecast multiple weather parameters, such as temperature, humidity, precipitation, wind speed, and air pressure, from historical and current weather data.

### Why Environmental Modeling Suits Multi-Target Prediction:
**Interrelated Targets**: Weather variables are naturally interrelated; for example, air pressure affects temperature and wind patterns, while humidity can influence precipitation levels. Multi-target prediction models can leverage these relationships to improve the accuracy of forecasts.  

**Data Efficiency**: Environmental datasets can be massive and complex, making efficient data use crucial. Multi-target prediction allows for shared learning across related targets, making better use of available data.  

**Predictive Performance**: By considering multiple weather parameters simultaneously, multi-target models can capture the complex interactions between different elements of the weather system, potentially leading to more accurate and reliable predictions.  

**Operational Efficiency**: For weather forecasting agencies and environmental researchers, the ability to produce multiple forecasts simultaneously from a single model can greatly improve operational efficiency and reduce computational costs.


## Section 4: Show-case - Drug Discovery

A compelling use case where multi-target prediction can make a significant difference compared to training several single-target models is in drug discovery and personalized medicine, particularly in predicting the effects of drugs on multiple genetic markers or cellular responses simultaneously.  

### Context and Challenge:
In drug discovery and personalized medicine, it's crucial to understand how different compounds interact with various genetic markers or cellular processes. A single drug can have multiple effects, impacting various genes, proteins, or pathways. Traditional approaches might involve building separate models to predict the effect of compounds on each marker or response, which can be inefficient and may ignore the interdependencies between these effects.  

### Multi-Target Prediction in Drug Discovery:
**Simultaneous Prediction**: Multi-target prediction models can be trained to predict the effects of compounds on multiple genetic markers or cellular responses at once. This is particularly valuable in high-throughput screening processes where thousands of compounds are tested for their effects on a wide range of targets.  

**Capturing Interdependencies**: By predicting multiple targets simultaneously, these models can capture the complex interactions and dependencies between different cellular responses or genetic markers, leading to a more holistic understanding of compound effects.  

**Efficiency and Cost-Effectiveness**: Multi-target models can reduce the computational and time costs associated with training and maintaining numerous single-target models. In drug discovery, where time and resources are critical, this can be a substantial advantage.  

**Improved Predictive Performance**: Leveraging shared information across multiple targets can enhance predictive performance, especially in cases where data for some targets might be sparse but related targets have abundant data.  

### Impact and Difference:
In the context of drug discovery and personalized medicine, using multi-target prediction can significantly accelerate the discovery process and enhance the understanding of drug effects. It allows researchers to efficiently identify compounds with desired effects across multiple targets, facilitating the development of more effective and safer drugs. Moreover, in personalized medicine, understanding the multi-faceted interactions between drugs and an individual's unique genetic makeup can lead to more tailored and effective treatments, ultimately improving patient outcomes.


## Section 5: Results and Analysis
Case I: Multi-Label Classification: [Enzyme substrates](https://www.kaggle.com/datasets/gopalns/ec-mixed-class?select=mixed_desc.csv)

In [2]:
import pandas as pd

data2 = pd.read_csv("mixed_ecfp.csv")

data2.head()

Unnamed: 0,CIDs,M1,M2,M3,M4,M5,M6,M7,M8,M9,...,M504,M505,M506,M507,M508,M509,M510,M511,M512,EC1_EC2_EC3_EC4_EC5_EC6
0,C00009,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1_1_1_1_0_1
1,C00013,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1_1_1_1_0_1
2,C00014,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1_1_1_1_0_1
3,C00017,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0_1_1_0_0_0
4,C00022,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1_1_1_1_0_1


In [3]:
# Split the "EC1_EC2_EC3_EC4_EC5_EC6" column into separate columns
targets_split = data2['EC1_EC2_EC3_EC4_EC5_EC6'].str.split('_', expand=True)

# Rename the columns for clarity
targets_split.columns = ['EC1', 'EC2', 'EC3', 'EC4', 'EC5', 'EC6']

# Concatenate the new target columns back to the original dataframe
data2 = pd.concat([data2.drop('EC1_EC2_EC3_EC4_EC5_EC6', axis=1), targets_split], axis=1)

# dropp the first column
data2 = data2.drop(data2.columns[0], axis=1)

# Separate features
X = data2.loc[:, 'M1':'M512']  # Adjust the column names if needed

# Separate target variables
Y = data2[['EC1', 'EC2', 'EC3', 'EC4', 'EC5', 'EC6']]

Y1 = Y['EC1']
Y2 = Y['EC2']
Y3 = Y['EC3']
Y4 = Y['EC4']
Y5 = Y['EC5']
Y6 = Y['EC6']

# convert the Y to int
data2 = data2.astype(int)

### Method 1: Multiple Single-Target Prediction:

In [4]:
# Split the dataset into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, Y1_train, Y1_test = train_test_split(X, Y1, test_size=0.2, random_state=42)
X_train, X_test, Y2_train, Y2_test = train_test_split(X, Y2, test_size=0.2, random_state=42)
X_train, X_test, Y3_train, Y3_test = train_test_split(X, Y3, test_size=0.2, random_state=42)
X_train, X_test, Y4_train, Y4_test = train_test_split(X, Y4, test_size=0.2, random_state=42)
X_train, X_test, Y5_train, Y5_test = train_test_split(X, Y5, test_size=0.2, random_state=42)
X_train, X_test, Y6_train, Y6_test = train_test_split(X, Y6, test_size=0.2, random_state=42)

# Train a Random Forest model
from sklearn.ensemble import RandomForestClassifier
model1 = RandomForestClassifier(n_estimators=100, random_state=42)
model1.fit(X_train, Y1_train)

model2 = RandomForestClassifier(n_estimators=100, random_state=42)
model2.fit(X_train, Y2_train)

model3 = RandomForestClassifier(n_estimators=100, random_state=42)
model3.fit(X_train, Y3_train)

model4 = RandomForestClassifier(n_estimators=100, random_state=42)
model4.fit(X_train, Y4_train)

model5 = RandomForestClassifier(n_estimators=100, random_state=42)
model5.fit(X_train, Y5_train)

model6 = RandomForestClassifier(n_estimators=100, random_state=42)
model6.fit(X_train, Y6_train)

# Make predictions
Y1_pred = model1.predict(X_test)
Y2_pred = model2.predict(X_test)
Y3_pred = model3.predict(X_test)
Y4_pred = model4.predict(X_test)
Y5_pred = model5.predict(X_test)
Y6_pred = model6.predict(X_test)

# Evaluate the model
from sklearn.metrics import accuracy_score
accuracy1 = accuracy_score(Y1_test, Y1_pred)
accuracy2 = accuracy_score(Y2_test, Y2_pred)
accuracy3 = accuracy_score(Y3_test, Y3_pred)
accuracy4 = accuracy_score(Y4_test, Y4_pred)
accuracy5 = accuracy_score(Y5_test, Y5_pred)
accuracy6 = accuracy_score(Y6_test, Y6_pred)

print(f'Accuracy1: {accuracy1}')
print(f'Accuracy2: {accuracy2}')
print(f'Accuracy3: {accuracy3}')
print(f'Accuracy4: {accuracy4}')
print(f'Accuracy5: {accuracy5}')
print(f'Accuracy6: {accuracy6}')

# print the confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix1 = confusion_matrix(Y1_test, Y1_pred)
confusion_matrix2 = confusion_matrix(Y2_test, Y2_pred)
confusion_matrix3 = confusion_matrix(Y3_test, Y3_pred)
confusion_matrix4 = confusion_matrix(Y4_test, Y4_pred)
confusion_matrix5 = confusion_matrix(Y5_test, Y5_pred)
confusion_matrix6 = confusion_matrix(Y6_test, Y6_pred)

print(f'Confusion matrix1: \n {confusion_matrix1}')
print(f'Confusion matrix2: \n {confusion_matrix2}')
print(f'Confusion matrix3: \n {confusion_matrix3}')
print(f'Confusion matrix4: \n {confusion_matrix4}')
print(f'Confusion matrix5: \n {confusion_matrix5}')
print(f'Confusion matrix6: \n {confusion_matrix6}')

# print the mean accuracy for all the models
mean_accuracy = (accuracy1 + accuracy2 + accuracy3 + accuracy4 + accuracy5 + accuracy6) / 6
print(f'Mean accuracy: {mean_accuracy}')

# print the aggregate confusion matrix for all the models
confusion_matrix_aggregate = confusion_matrix1 + confusion_matrix2 + confusion_matrix3 + confusion_matrix4 + confusion_matrix5 + confusion_matrix6

print(f'Confusion matrix aggregate: \n {confusion_matrix_aggregate}')

Accuracy1: 0.6201923076923077
Accuracy2: 0.625
Accuracy3: 0.6826923076923077
Accuracy4: 0.7211538461538461
Accuracy5: 0.875
Accuracy6: 0.8509615384615384
Confusion matrix1: 
 [[53 44]
 [35 76]]
Confusion matrix2: 
 [[38 50]
 [28 92]]
Confusion matrix3: 
 [[119  23]
 [ 43  23]]
Confusion matrix4: 
 [[136  10]
 [ 48  14]]
Confusion matrix5: 
 [[177   6]
 [ 20   5]]
Confusion matrix6: 
 [[175   7]
 [ 24   2]]
Mean accuracy: 0.7291666666666666
Confusion matrix aggregate: 
 [[698 140]
 [198 212]]


### Method 2: Multi-label Classification using Scikit-MultiLearn Library:

In [5]:
import pandas as pd
from sklearn.metrics import accuracy_score, confusion_matrix, multilabel_confusion_matrix
import numpy as np
from skmultilearn.problem_transform import ClassifierChain
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Initialize ClassifierChain with RandomForestClassifier
classifier = ClassifierChain(RandomForestClassifier(random_state=42))

y_train = y_train.astype(int)
y_test = y_test.astype(int)


# Train the model
classifier.fit(X_train, y_train)

# Make predictions
predictions = classifier.predict(X_test)

# convert predictions to a dense format
predictions_dense = predictions.toarray() if hasattr(predictions, "toarray") else predictions

# Calculate accuracy for each label
accuracy_scores = [accuracy_score(y_test.iloc[:, i], predictions_dense[:, i]) for i in range(y_test.shape[1])]

# Calculate confusion matrix for each label
confusion_matrices = [confusion_matrix(y_test.iloc[:, i], predictions_dense[:, i]) for i in range(y_test.shape[1])]

# Calculate multilabel confusion matrix
multi_conf_matrix = multilabel_confusion_matrix(y_test, predictions_dense)

# Print accuracy scores for each label
for i, score in enumerate(accuracy_scores, 1):
    print(f"Accuracy for label EC{i}: {score}")

# Analyze confusion matrices
# For individual labels:
for i, cm in enumerate(confusion_matrices, 1):
    print(f"Confusion Matrix for label EC{i}:\n{cm}\n")

# For the multilabel confusion matrix:
print("Multilabel Confusion Matrix:\n", multi_conf_matrix)

Accuracy for label EC1: 0.6201923076923077
Accuracy for label EC2: 0.6201923076923077
Accuracy for label EC3: 0.7019230769230769
Accuracy for label EC4: 0.7211538461538461
Accuracy for label EC5: 0.8701923076923077
Accuracy for label EC6: 0.8461538461538461
Confusion Matrix for label EC1:
[[53 44]
 [35 76]]

Confusion Matrix for label EC2:
[[36 52]
 [27 93]]

Confusion Matrix for label EC3:
[[121  21]
 [ 41  25]]

Confusion Matrix for label EC4:
[[133  13]
 [ 45  17]]

Confusion Matrix for label EC5:
[[176   7]
 [ 20   5]]

Confusion Matrix for label EC6:
[[173   9]
 [ 23   3]]

Multilabel Confusion Matrix:
 [[[ 53  44]
  [ 35  76]]

 [[ 36  52]
  [ 27  93]]

 [[121  21]
  [ 41  25]]

 [[133  13]
  [ 45  17]]

 [[176   7]
  [ 20   5]]

 [[173   9]
  [ 23   3]]]


In [6]:
# print the mean accuracy for all the models
mean_accuracy_multilabel = np.mean(accuracy_scores)
print(f'Mean accuracy: {mean_accuracy_multilabel}')

# print the aggregate confusion matrix for all the models
confusion_matrix_aggregate_multilabel = np.sum(confusion_matrices, axis=0)

print(f'Confusion matrix aggregate: \n {confusion_matrix_aggregate_multilabel}')

Mean accuracy: 0.7299679487179486
Confusion matrix aggregate: 
 [[692 146]
 [191 219]]


### Method 3: Clustering trees - using SPYCT Library:

In [7]:
import spyct

y_test_spyct_0 = y_test.values

# Create a model
model = spyct.Model(num_trees=100, max_depth=4, random_state=123)

# Train the model
model.fit(X_train.values, y_train.values)

# Make predictions
y_pred_spyct = model.predict(X_test.values)

# rounding the values in y_pred_spyct using the threshold of 0.5
y_pred_spyct_rounded = y_pred_spyct.round()

# Calculate accuracy for each label
accuracy_scores_spyct = [accuracy_score(y_test.iloc[:, i], y_pred_spyct_rounded[:, i]) for i in range(y_test.shape[1])]
print(accuracy_scores_spyct)


# Calculate confusion matrix for each label
confusion_matrices_spyct = [confusion_matrix(y_test.iloc[:, i], y_pred_spyct_rounded[:, i]) for i in range(y_test.shape[1])]
print(confusion_matrices_spyct)

[0.6298076923076923, 0.6153846153846154, 0.6971153846153846, 0.7019230769230769, 0.8798076923076923, 0.875]
[array([[45, 52],
       [25, 86]], dtype=int64), array([[ 18,  70],
       [ 10, 110]], dtype=int64), array([[138,   4],
       [ 59,   7]], dtype=int64), array([[146,   0],
       [ 62,   0]], dtype=int64), array([[183,   0],
       [ 25,   0]], dtype=int64), array([[182,   0],
       [ 26,   0]], dtype=int64)]


In [8]:
# print the mean accuracy for all the models
mean_accuracy_spyct = np.mean(accuracy_scores_spyct)

print(f'Mean accuracy: {mean_accuracy_spyct}')

# print the aggregate confusion matrix for all the models
confusion_matrix_aggregate_spyct = np.sum(confusion_matrices_spyct, axis=0)

print(f'Confusion matrix aggregate: \n {confusion_matrix_aggregate_spyct}')

Mean accuracy: 0.733173076923077
Confusion matrix aggregate: 
 [[712 126]
 [207 203]]


In [9]:
# computing the correlation between the labels
Y = Y.astype(int)

correlation = Y.corr()
correlation

Unnamed: 0,EC1,EC2,EC3,EC4,EC5,EC6
EC1,1.0,-0.165158,-0.289396,0.046609,-0.003054,0.087381
EC2,-0.165158,1.0,0.03199,-0.009563,0.043308,0.094042
EC3,-0.289396,0.03199,1.0,-0.096364,-0.050693,-0.017205
EC4,0.046609,-0.009563,-0.096364,1.0,0.114881,0.178339
EC5,-0.003054,0.043308,-0.050693,0.114881,1.0,0.046642
EC6,0.087381,0.094042,-0.017205,0.178339,0.046642,1.0


### Case II: Multi-target regression: 
- All target variables are numeric and continuous.
- Several approaches can be take to solve the problem.

### Dataset information: 
* Assessing the heating load and cooling load requirements of buildings (that is, energy efficiency) as a function of building parameters.
-   Number of features: 8
-   Number of targets: 2
-   Number of instances: 768
-   Features types: Integer, Real
-   Target types: Real

**Additional information:**  
This dataset is used for performing energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. We simulate various settings as functions of the afore-mentioned characteristics to obtain 768 building shapes. The dataset comprises 768 samples and 8 features, aiming to predict two real valued responses. It can also be used as a multi-class classification problem if the response is rounded to the nearest integer.

**Source**: [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/242/energy+efficiency)

### Method 1: Multiple Single-Target Prediction:

In [10]:
import pandas as pd

# Load the dataset
data = pd.read_csv("Engeff.csv")

data.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2
0,0.98,514.5,294.0,110.25,7.0,2,0.0,0,15.55,21.33
1,0.98,514.5,294.0,110.25,7.0,3,0.0,0,15.55,21.33
2,0.98,514.5,294.0,110.25,7.0,4,0.0,0,15.55,21.33
3,0.98,514.5,294.0,110.25,7.0,5,0.0,0,15.55,21.33
4,0.9,563.5,318.5,122.5,7.0,2,0.0,0,20.84,28.28


In [11]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt

# Splitting the dataset into features and targets
X = data.drop(['Y1', 'Y2'], axis=1)
y1 = data['Y1']
y2 = data['Y2']

# Splitting the dataset into the training and testing set
X_train_1, X_test_1, y1_train, y1_test = train_test_split(X, y1, test_size=0.2, random_state=123)
X_train_2, X_test_2, y2_train, y2_test = train_test_split(X, y2, test_size=0.2, random_state=123)

rf_regressor_y1 = RandomForestRegressor(n_estimators=100, random_state=123)
rf_regressor_y2 = RandomForestRegressor(n_estimators=100, random_state=123)

rf_regressor_y1.fit(X_train_1, y1_train)
rf_regressor_y2.fit(X_train_2, y2_train)

y1_pred = rf_regressor_y1.predict(X_test_1)
y2_pred = rf_regressor_y2.predict(X_test_2)

rmse_y1 = sqrt(mean_squared_error(y1_test, y1_pred))
rmse_y2 = sqrt(mean_squared_error(y2_test, y2_pred))

(rmse_y1, rmse_y2, (rmse_y1 + rmse_y2) )


(0.4927568965432238, 1.3538390086578387, 1.8465959052010625)

### Method 2: Multi-Target Regression using Scikit Library

In [12]:
# Splitting the dataset into features and targets
X = data.drop(['Y1', 'Y2'], axis=1)
y = data[['Y1', 'Y2']]

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Initializing and training the RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=123)
rf_regressor.fit(X_train, y_train)

# Making predictions on the test set
y_pred = rf_regressor.predict(X_test)

# Calculating the RMSE for each target
rmse_y1 = sqrt(mean_squared_error(y_test['Y1'], y_pred[:, 0]))
rmse_y2 = sqrt(mean_squared_error(y_test['Y2'], y_pred[:, 1]))

(rmse_y1, rmse_y2, rmse_y1 + rmse_y2)

(0.47121350132521606, 1.425401076434734, 1.8966145777599501)

### Method 3: Clustering trees - using SPYCT Library:

In [13]:
import spyct

y_test_spyct = y_test.values

# Create a model
model = spyct.Model(splitter="grad", num_trees=200, random_state=123)

# Train the model
model.fit(X_train.values, y_train.values)

# Make predictions
y_pred_spyct = model.predict(X_test.values)

# Calculating the RMSE for each target
rmse_y1_spyct = sqrt(mean_squared_error(y_test_spyct[:, 0], y_pred_spyct[:, 0]))
rmse_y2_spyct = sqrt(mean_squared_error(y_test_spyct[:, 1], y_pred_spyct[:, 1]))

(rmse_y1_spyct, rmse_y2_spyct, (rmse_y1_spyct+ rmse_y2_spyct))

(0.44593994176199037, 1.3962173388169186, 1.8421572805789088)

In [14]:
# checking the correlation between the two targets
data[['Y1', 'Y2']].corr()

Unnamed: 0,Y1,Y2
Y1,1.0,0.975862
Y2,0.975862,1.0


## Section 6: Caution
### Situations Where Multi-Target Prediction Might Not Be Ideal:
**Unrelated Targets**: If the targets are largely independent, with little to no correlation, multi-target prediction might not offer significant benefits over single-target models and could even complicate the modeling process unnecessarily.  
**Highly Correlated Targets**: If target variables are highly correlated, multi-target prediction will not increase the accuracy compared to single-target predictions. 

**Vastly Different Scales or Types of Targets**: When targets have vastly different scales or are of different types (e.g., one is categorical, and another is continuous), it might be challenging to design a single model that effectively predicts all targets without bias or scale issues.  

### Common Pitfalls:
**Overfitting**: Just like any machine learning model, multi-target models can suffer from overfitting, especially if they're complex and the data is not sufficient to support the learning of multiple targets. Regularization, proper validation, and complexity control are essential to mitigate this risk.  

**Data Leakage**: In multi-target settings, there's a risk of data leakage between targets, especially if some targets could be predictors for others. This could lead to overly optimistic performance estimates. Proper data handling and validation strategies are crucial to prevent this.  

**Evaluation Complexity**: Evaluating the performance of multi-target models can be more complex than single-target models because you need to consider the performance across all targets. Selecting appropriate and comprehensive evaluation metrics is key.  

**Computational Cost**: While multi-target models can be efficient by sharing representations among targets, they can also be computationally intensive, especially if the number of targets is large and the model architecture is complex.

## Section 7: Summary
### Relevance in Machine Learning:
**Efficiency and Performance**: Multi-target models can be more efficient than training separate models for each target, both in terms of computational resources and data usage. They often achieve better performance by exploiting the relationships among targets.  

**Complex Problem Solving**: Multi-target prediction is essential for complex problems in fields such as environmental science, health, finance, and more, where multiple factors are interrelated and need to be predicted simultaneously.  

**Advanced Insights**: By predicting multiple targets at once, these models can provide more holistic insights into the underlying problem, which can be critical for decision-making in various applications.  

### Take-home Messages:
**Interrelated Targets**: Consider multi-target prediction when your targets are interrelated. The approach is most beneficial when the prediction of one target can inform the prediction of another.  

**Model Complexity**: Be mindful of the model complexity. While multi-target models can leverage shared information, they can also become overly complex and prone to overfitting. Proper regularization and validation are key.  

**Data Considerations**: Ensure your dataset is sufficient and appropriate for multi-target prediction. The quality and quantity of data, along with how well it represents the interrelations among targets, are crucial for the success of these models.  

**Evaluation Strategies**: Adopt comprehensive evaluation strategies. Given the multiple outcomes, it's important to use evaluation metrics that can adequately assess the model's performance across all targets.  

**Practicality and Relevance**: Assess the practicality and relevance of multi-target prediction for your specific problem. While it offers significant advantages in many scenarios, it's not a one-size-fits-all solution and might not be suitable for problems with independent targets or where the complexity outweighs the benefits.  

In summary, multi-target prediction is a powerful approach in machine learning for addressing complex problems with interrelated targets, offering efficiency and potentially enhanced predictive performance. However, it requires careful consideration of the problem context, data quality, model complexity, and evaluation strategies to be effectively implemented.

# Section 8: References and Resources
- Papers:
    1. Jasmin Bogatinovski, Ljupčo Todorovski, Sašo Džeroski, Dragi Kocev (2022): Comprehensive comparative study of multi-label classification methods, Expert Systems with Applications, Volume 203
    2. Dragi Kocev, Celine Vens, Jan Struyf, Sašo Džeroski (2013): Tree ensembles for predicting structured outputs, Pattern Recognition, Volume 46, Issue 3

- Blogs: 
    * https://la.mathworks.com/help/deeplearning/ug/multilabel-image-classification-using-deep-learning.html

- Data Sources:
    1. Enzyme Substrates: Kaggle - [Structural information dataset](https://www.kaggle.com/datasets/gopalns/ec-mixed-class?select=mixed_desc.csv)
    2. Energy Efficiency: [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/dataset/242/energy+efficiency)
- Generative AI tools: 
    1. ChatGPT  
    2. Copilot
- Author contributions:
    1. Amirhooshang Navaei: Basics of Multi-Target Prediction, finding datasources, Code
    2. Daneil Gombas: Papers, technical aspects of topic, code

## Section 9: Review Questions

1. What is multi-target prediction, and how does it differ from single-target prediction?  

2. Can you explain the difference between multi-class classification and multi-label classification?  

3. Describe a real-world application where multi-target prediction is particularly useful. Why is it preferred over single-target models in this scenario?  

4. What are some common domains or fields where multi-target prediction models are employed?  

5. How do multi-target prediction models leverage the relationships among multiple targets to improve prediction accuracy?  

6. Discuss at least one machine learning algorithm that can be adapted for multi-target prediction. How does it work?

7. What are some important considerations when preparing data for a multi-target prediction model?  

8. How might the preprocessing steps for a multi-target prediction model differ from those of a single-target prediction model?  

9. Identify and explain a common pitfall in multi-target prediction and suggest a way to mitigate it.  

10. Why is overfitting a concern in multi-target prediction, and what strategies can be used to prevent it?  

11. What are some key metrics for evaluating the performance of multi-target prediction models? How do these differ from single-target evaluation metrics? 