In [None]:
# Import IForest from pyod
from pyod.models.iforest import IForest

# Detecting outliers with IForest
IForest is a robust estimator and only requires a few lines of code to detect outliers from any dataset. You may find that this syntax looks familiar since it closely resembles sklearn syntax.

In [None]:
# Initialize an instance with default parameters
iforest = IForest()

# Generate outlier labels
labels = iforest.fit_predict(big_mart)

# Filter big_mart for outliers
outliers = big_mart[labels==1]

print(outliers.shape)

# Choosing contamination
Even though the code implementation only takes a few lines, finding the suitable contamination requires attention.

Recall that contamination parameter only affects the results of IForst. Once IForest generates raw anomaly scores, contamination is used to chose the top n% of anomaly scores as outliers. For example, 5% contamination will choose the observations with the highest 5% of anomaly scores as outliers.

Although we will discuss some tuning methods in the following video, for now, you will practice setting an arbitrary value to the parameter.

In [None]:
# Instantiate an instance with 5% contamination
iforest = IForest(contamination=0.05)

# Fit IForest to Big Mart sales data
iforest.fit(big_mart)

# Choosing n_estimators
n_estimators is the parameter that influences model performance the most. Building IForest with enough trees ensures that the algorithm has enough generalization power to isolate the outliers from normal data points. The optimal number of trees depends on dataset size, and any number that is too high or too low will lead to inaccurate predictions.

In [None]:
# Create an IForest with 300 trees
iforest = IForest(n_estimators=300)

# Fit to the Big Mart sales data
iforest.fit(big_mart)


# Tuning contamination
Finally, it is time to tune the notorious contamination parameter. The evaluate_outlier_classifier and evaluate_regressor functions from the video are already loaded for you. You can inspect them below.

In [None]:
def evaluate_outlier_classifier(model, data):
    # Get labels
    labels = model.fit_predict(data)

    # Return inliers
    return data[labels == 0]
def evaluate_regressor(inliers):
    X = inliers.drop("price", axis=1)
    y = inliers[['price']]
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10)

    lr = LinearRegression()
    lr.fit(X_train, y_train)

    preds = lr.predict(X_test)
    rmse = mean_squared_error(y_test, preds, squared=False)

    return round(rmse, 3)

In [None]:
# Create a list of contaminations and an empty dictionary
contaminations = [0.07, 0.1, 0.15, 0.25]
scores = dict()

for c in contaminations:
    # Instantiate IForest with the current c
    iforest = IForest(contamination=c, random_state=10)
    
    # Get inliers with the current IForest
    inliers = evaluate_outlier_classifier(iforest, airbnb_df)
    
    # Calculate and store RMSE into scores
    scores[c]= evaluate_regressor(inliers)
    
print(scores)

# Tuning multiple hyperparameters
In this exercise, you will practice tuning multiple hyperparameters simultaneously. This is a valuable topic to learn, as hyperparameters of an algorithm usually affect each other's values. Therefore, tuning them individually is not usually the recommended course of action.

You will tune the max_features and max_samples parameters of IForest using a sample of the Big Mart sales data.

In [None]:
max_features = [0.6, 0.8, 1]
max_samples = [0.8, 0.9, 1]
scores = dict()

for mf, ms in product(max_features, max_samples):
    # Instantiate an IForest
    iforest = IForest(max_features=mf, max_samples=ms, n_jobs=-1, contamination=.25, random_state=1)
    
    # Get the inliers with the current IForest
    inliers = evaluate_outlier_classifier(iforest, airbnb_df)
    
    # Calculate and store RMSE into scores
    scores[(mf,ms)] = evaluate_regressor(inliers)
    
print(scores)

# Alternative way of classifying with IForest
Until now, you have been using the .fit_predict() method to fit IForest and generate predictions simultaneously. However, pyod documentation suggests using the fit function first and accessing the inlier/outlier labels_ via a handy attribute.

In [None]:
iforest = IForest(n_estimators=200)

# Fit (only fit) it to the Big Mart sales
iforest.fit(big_mart)

# Access the labels_ for the data
labels = iforest.labels_

# Filter outliers from big_mart
outliers = big_mart[labels==1]

print(len(outliers))

# Using outlier probabilities
An alternative to isolating outliers with contamination is using outlier probabilities. The best thing about this method is that you can choose an arbitrary probability threshold, which means you can be as confident as you want in the predictions.

In [None]:
iforest = IForest(random_state=10).fit(big_mart)

# Calculate probabilities
probs = iforest.predict_proba(big_mart)

# Extract the probabilities for outliers
outlier_probs = probs[:,1]

# Filter for when the probability is higher than 70%
outliers = big_mart[outlier_probs>0.70]

print(len(outliers))