In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Print a 5-number summary
One of the quickest methods for getting a feel for new data is the 5-number summary. It prints out 5 metrics about a distribution - the minimum, 25th percentile, median, 75th percentile, and the maximum along with mean and standard deviation. By looking at the 5-number summary and the difference between the mean and the minimum/maximum values, you can get a rough idea of whether outliers are present in the distribution.

In [None]:
# Extract price
prices = airbnb_df["price"]

# Print 5-number summary
# para detectar outliers debemos analizar los si el valor maximo difiere demasiado de la media
print(prices.describe())

# Histograms for outlier detection
A histogram can be a compelling visual for finding outliers. They can become apparent when an appropriate number of bins is chosen for the histogram. Recall that the square root of the number of observations can be used as a rule of thumb for setting the number of bins. Usually, the bins with the lowest heights will contain outliers.

In [None]:
# Find the square root of the length of prices
n_bins =np.sqrt(len(prices))

# Cast to an integer
n_bins = int(n_bins)

plt.figure(figsize=(8, 4))

# Create a histogram
plt.hist(prices, bins=n_bins, color='red')
plt.show()

# Scatterplots for outlier detection
A scatterplot is another handy method to identify outliers visually. Although it is usually used to plot two variables against each other to inspect their relationship, using the trick from the video, you can plot a scatterplot with only one variable to make the outliers stand out.

In [None]:
# Create a list of consecutive integers
integers = range(len(prices))

plt.figure(figsize=(16, 8))

# Plot a scatterplot
plt.scatter(integers, prices, c='red', alpha=0.5)
plt.show()

# Boxplots for outlier detection
In this exercise, you will get a feel of what the US Airbnb Listings prices data looks like using boxplots. This will enable you to assess the range of the distribution where inliers lie. You will also get a sense of custom versus default parameters for setting whisker lengths to classify outliers.

In [None]:
# Create a boxplot of prices
plt.boxplot(prices)
plt.show()

In [None]:
# Create a boxplot with custom whisker lengths
plt.boxplot(prices, whis=5)
plt.show()

# Calculating outlier limits with IQR
Visualizing outliers is usually only the first step in detecting outliers. To go beyond visualizing outliers, you will need to write code that isolates the outliers from the distribution.

In [None]:
# Calculate the 25th and 75th percentiles
q1 = prices.quantile(0.25)
q3 = prices.quantile(0.75)

# Find the IQR
IQR = q3-q1
factor = 2.5

# Calculate the lower limit
lower_limit = q1 - (factor*IQR)

# Calculate the upper limit
upper_limit = q3 + (factor*IQR)
# Create a mask for values lower than lower_limit
is_lower = prices<lower_limit

# Create a mask for values higher than upper_limit
is_higher = prices>upper_limit

# Combine the masks to filter for outliers
outliers = prices[is_lower|is_higher]

# Count and print the number of outliers
print(len(outliers))

# Finding outliers with z-scores
The normal distribution is ubiquitous in the natural world and is the most common distribution. This is why the z-score method can be one of the quickest methods for detecting outliers.

Recall the rule of thumb from the video: if a sample is more than three standard away deviations from the mean, you can consider it an extreme value.

However, recall also that the z-score method should be approached with caution. This method is appropriate only when we are confident our data comes from a normal distribution. Otherwise, the results might be misleading

In [None]:
# Import the zscores function
from scipy.stats  import zscore

# Find the zscores of prices
scores = zscore(prices)

# Check if the absolute values of scores are over 3
is_over_3 = np.abs(scores)>3

# Use the mask to subset prices
outliers = prices[is_over_3]

print(len(outliers))

# Using modified z-scores with PyOD
It is time to unleash pyod on outliers. We use the MAD estimator from pyod to utilize modified z-scores. The estimator already uses the median_abs_deviation function under the hood, so it is unnecessary to repeat the previous steps.

In [None]:
# Initialize with a threshold of 3.5
mad = MAD(threshold=3.5)

# Reshape prices to make it 2D
prices_reshaped = prices.values.reshape(-1, 1)

# Fit and predict outlier labels on prices_reshaped
labels = mad.fit_predict(prices_reshaped)

# Filter for outliers
outliers = prices[labels == 1]

print(len(outliers))