<a href="https://colab.research.google.com/github/guilhermelaviola/ApplicationsOfDataScienceInDisruptiveTechnologies/blob/main/Class04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MIL (Multiple Instance Learning)**
Multiple instance learning (MIL) is a flexible and robust approach to data labeling in machine learning, particularly for dealing with noisy, incomplete, or ambiguous real-world data. MIL operates on the concept of "bags," where each bag contains multiple instances. A bag is labeled positive if at least one instance within it is positive, and negative if all instances are negative. This makes MIL suitable for scenarios where instance-level labeling is difficult, expensive, or even impossible. MIL has proven effective in various applications, such as object detection in images, medical diagnosis, document analysis, and spam filtering. It focuses on global patterns within a bag, rather than being overly influenced by potentially noisy or mislabeled individual instances. MIL implementation typically involves using traditional machine learning algorithms, but the goal is to minimize classification error in bags. Platforms like Google Colab provide an accessible environment for experimenting with MIL, allowing users to build and train models using real-world datasets and share code and results with other researchers and professionals. As the amount of data continues to grow exponentially, MIL will likely play an increasingly important role in extracting meaningful insights from complex and challenging data.

In [1]:
# Importing all the necessary libraries and resources:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb

## **MIL on Google Colab**

In [2]:
# Function to create bags with more informative features and instances:
def create_bags(num_bags=100, num_instances_per_bag=20):
    X, y = [], []
    num_positive_bags = num_bags // 2
    num_negative_bags = num_bags - num_positive_bags

    for _ in range(num_positive_bags):
        # Increasing the number of features and make them more informative:
        bag = make_classification(n_samples=num_instances_per_bag, n_features=20, n_informative=15, n_redundant=5)
        X.append(bag[0])
        y.append(1)

    for _ in range(num_negative_bags):
        bag = make_classification(n_samples=num_instances_per_bag, n_features=20, n_informative=15, n_redundant=5)
        X.append(bag[0])
        y.append(0)

    return X, y

# Generaring the bags:
X, y = create_bags()

# Flatten of the data from the bags to train the model:
X_flat = np.array([instance for bag in X for instance in bag])
y_flat = np.repeat(y, len(X[0]))

# Splitting the data into training and test:
X_train, X_test, y_train, y_test = train_test_split(X_flat, y_flat, test_size=0.3, stratify=y_flat, random_state=42)

# Defining the parameters for optimization with more regulation:
param_grid = {
    'n_estimators': [100],  # Keeping the estimators
    'max_depth': [10],      # Restricting the depth
    'min_samples_split': [5, 10],  # Increasing the criteria for division
    'min_samples_leaf': [4, 6]     # Additional regularization
}

# Configuring the GridSearchCV:
grid_search = GridSearchCV(estimator=RandomForestClassifier(random_state=42), param_grid=param_grid, cv=3, scoring='roc_auc')
grid_search.fit(X_train, y_train)

# Best model found by GridSearchCV:
best_model = grid_search.best_estimator_

# Making predictions with the best model:
y_pred_best_rf = best_model.predict(X_test)

# Evaluating the accuracy of the optimized Random Forest model:
accuracy_best_rf = accuracy_score(y_test, y_pred_best_rf)
print(f'Accuracy of the optimized Random Forest model: {accuracy_best_rf:.2f}')

# Evaluating the optimized model with ROC AUC:
roc_score_best_rf = roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1])
print(f'ROC AUC of the optimized Random Forest model: {roc_score_best_rf:.2f}')

# Training the Gradient Boosting model:
model_gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=10, random_state=42)
model_gb.fit(X_train, y_train)

# Making predictions with Gradient Boosting:
y_pred_gb = model_gb.predict(X_test)

# Evaluating the accuracy of Gradient Boosting ROC AUC:
accuracy_gb = accuracy_score(y_test, y_pred_gb)
roc_score_gb = roc_auc_score(y_test, model_gb.predict_proba(X_test)[:, 1])

print(f'Accuracy of Gradient Boosting model: {accuracy_gb:.2f}')
print(f'ROC AUC of Gradient Boosting model: {roc_score_gb:.2f}')

# Training the XGBoost model:
model_xgb = xgb.XGBClassifier(n_estimators=100, max_depth=10, learning_rate=0.1, random_state=42)
model_xgb.fit(X_train, y_train)

# Making predictions with XGBoost:
y_pred_xgb = model_xgb.predict(X_test)

# Evaluating the accuracy of ROC AUC and XGBoost
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
roc_score_xgb = roc_auc_score(y_test, model_xgb.predict_proba(X_test)[:, 1])

print(f'Accuracy of modelo XGBoost model: {accuracy_xgb:.2f}')
print(f'ROC AUC of XGBoost model: {roc_score_xgb:.2f}')

Accuracy of the optimized Random Forest model: 0.60
ROC AUC of the optimized Random Forest model: 0.65
Accuracy of Gradient Boosting model: 0.60
ROC AUC of Gradient Boosting model: 0.63
Accuracy of modelo XGBoost model: 0.63
ROC AUC of XGBoost model: 0.66


## **MIL Performance**

In [3]:
# Function to create bags with multiple instances:
def create_bags(num_bags=200, num_instances_per_bag=10):
    X, y = [], []
    # Defining half of the bags as positive and the other half as negative:
    num_positive_bags = num_bags // 2
    num_negative_bags = num_bags - num_positive_bags

    for _ in range(num_positive_bags):
        # Generating data for positive bags:
        bag = make_classification(n_samples=num_instances_per_bag, n_features=10, n_informative=5, n_redundant=3)
        X.append(bag[0])
        y.append(1)

    for _ in range(num_negative_bags):
        # Generating data for negative bags:
        bag = make_classification(n_samples=num_instances_per_bag, n_features=10, n_informative=5, n_redundant=3)
        X.append(bag[0])
        y.append(0)

    return X, y

# Creating the bags and instances:
X, y = create_bags()

# Transforming the data from the bags into a format suitable for the model:
X_flat = np.array([instance for bag in X for instance in bag])
y_flat = np.repeat(y, len(X[0]))

# Splitting the data into training and test:
X_train, X_test, y_train, y_test = train_test_split(X_flat, y_flat, test_size=0.3, stratify=y_flat, random_state=42)

# Training the Random Forest model:
model_rf = RandomForestClassifier(n_estimators=100, random_state=42)
model_rf.fit(X_train, y_train)

# Making predictions on the test set:
y_pred = model_rf.predict(X_test)

# Evaluating the accuracy of the model:
accuracy_rf = accuracy_score(y_test, y_pred)
print(f'Accuracy of the Random Forest model: {accuracy_rf:.2f}')

# Avaliando o modelo com ROC AUC
roc_score_rf = roc_auc_score(y_test, model_rf.predict_proba(X_test)[:, 1])
print(f'ROC AUC of the Random Forest model: {roc_score_rf:.2f}')

Accuracy of the Random Forest model: 0.55
ROC AUC of the Random Forest model: 0.58
