# Outliers and imputation

The first step in preparing the data for analysis concerns **outliers and missing values**. Outliers are data samples that are so different from the remaining that they can skew your analysis if not removed from the dataset. Missing values are values that for some reason have not been informed for given features of given samples. Aside from the missing information they would provide, scikit-learn estimators generally expect your `DataFrame` to be complete.

To understand the resources provided by scikit-learn for this step, we'll use the New York City AirBnb dataset available at Kaggle. To download it, follow the [first stage of this tutorial](https://medium.com/@yvettewu.dw/tutorial-kaggle-api-google-colaboratory-1a054a382de0), which shows how to download access credentials for Kaggle (`kaggle.json`). Once you have downloaded the credentials, use the side menu to upload the file to Colab, and run the cells below:

In [0]:
import pandas as pd
import seaborn as sns
sns.set()

In [0]:
!mkdir /root/.kaggle
!cp /content/kaggle.json /root/.kaggle/kaggle.json
!chmod 600 /root/.kaggle/kaggle.json

In [0]:
!kaggle datasets download -d dgomonov/new-york-city-airbnb-open-data

> In this case, since the dataset is not the only file in the zip file, we first have to unzip everything before loading the dataset:

In [0]:
!unzip new-york-city-airbnb-open-data.zip

> Should any of the cells above fail, contact the maintainers of scikit-zero ;)

Let's take a peek into the dataset:

In [0]:
nyc_airbnb = pd.read_csv("AB_NYC_2019.csv")
nyc_airbnb.head()

In [0]:
nyc_airbnb.shape

Note that now we have a much larger number of samples than in the iris dataset we used previously. Regarding the number of features, this dataset is actually small for real-world standards, but will help keep our notebook simple. Let's start dropping some features which won't help in our analysis.

> `last_review` is a `datetime` field, which we could use for time series analysis. However, to make this notebook simple we're gonna discard it.

In [0]:
discard = ["id", "last_review"]

In [0]:
X = nyc_airbnb.drop(discard, axis=1)

Before moving on to the specific topics of this analysis, let's review the basics on missing values. We can check them as follows:

In [0]:
X.isna().sum()

In the case of features where very few samples have missing values, it's often safe to discard those samples. We can do that using the `dropna()` method, and specifying that we only want to drop samples for which the given subset of features present missing values (`subset=["name", "host_name"]`):

In [0]:
X = X.dropna(subset=["name", "host_name"])

In [0]:
X.isna().sum()

In [0]:
X.shape

## Detecting outliers

The very first step into data preparation is detecting and removing outliers from the data. To do that, we have to select which features we'll consider for this analysis, and in general they will be numerical. Looking at the feature distributions, we'll go with the following:

In [0]:
numerical = ["price", "minimum_nights","number_of_reviews", 
             "calculated_host_listings_count", "availability_365"]

Since these features all follow an exponential distribution, we'll start by transforming them using a logarithmic transformation:

> Check [pandas-zero](https://github.com/leobezerra/pandas-zero) if you missed that episode ;)

In [0]:
import numpy as np

In [0]:
X.loc[:,numerical] = X[numerical].apply(np.log1p, axis=1)

In [0]:
sns.distplot(X["price"], bins=100)

Note that now the numerical features we selected follow a distribution that is more similar to a normal distribution. The next step is selecting an unsupervised learning algorithm to cluster the data and identify the samples that do not belong to that big cluster. Scikit-learn offers a few options, and here we're gonna take **local outlier factor** (LOF) as an example:

In [0]:
from sklearn.neighbors import LocalOutlierFactor
clf = LocalOutlierFactor()

Let's isolate the numerical features to easen our task:

> Since the index of the data is preserved, we'll be able to apply the insights obtained from X_num to X later ;)

In [0]:
X_num = X[numerical]

In the context of outlier detection, we use the `fit_predict()` method directly on the input features, without having a target feature `y` to predict. The predicted values will either be 1 (an inlier) or -1 (an outlier), according to the internal model produced by LOF. Since the output is provided as a numpy array, we wrap it as a Pandas `Series` and specify that we want to preserve the index from `X_num`:

In [0]:
predicted = pd.Series(clf.fit_predict(X_num), name="predicted", index=X_num.index)

Now we can concatenate the `X_num` dataframe with the predicted labels, and query those who were identified as outliers:

In [0]:
X_predicted = pd.concat([X_num, predicted], axis=1)
X_predicted.head()

In [0]:
X_outliers = X_predicted.query("predicted == -1")
X_outliers.head()

In [0]:
X_outliers.shape

Since we have preserved the index, we can use them to drop the samples in `X` that LOF indicated as outliers:

In [0]:
X = X.drop(X_predicted.query("predicted == -1").index)
X.head()

In [0]:
X.shape

## Imputing missing values

Missing values can compromise the ability of an estimator, particularly the ones in scikit-learn. As we've done previously, sometimes it makes sense to drop samples or even whole features when the proportion of missing values allows that:

* if only a few samples present missing values for a given feature, the samples could likely be discarded.
* when almost all samples present missing values for a given feature, the feature could likely be discarded.

When none of the conditions above apply, the typical approach is to **impute** data, i.e., to produce artificial values based on the available data. The types of imputation methods vary as to the nature of the data used, as we discuss next.

### Based solely on the given feature

The simplest imputation approach is to fill values of a given feature based only on the values available for that feature from the remaining samples of the dataset. This is provided by scikit-learn as the `SimpleImputer` preprocessor, which fills missing values with the feature mean, by default:

In [0]:
from sklearn.impute import SimpleImputer
mean_imputer = SimpleImputer()

Since we're only interested in imputing the `reviews_per_month` feature, we'll use the `make_column_transformer` method from the `compose` module. This method allows us to specify preprocessing approaches for specific features.

- Let's understand the code below:
```python 
mean_transformer = make_column_transformer(
                                             (mean_imputer, ["reviews_per_month"]),
                                             remainder="drop"
                                            )
```
- Creates the column transformer, specifying that `mean_imputer` should be applied to `["reviews_per_month"]`, and that the remainder of the features should be dropped. 

In [0]:
from sklearn.compose import make_column_transformer
mean_transformer = make_column_transformer(
                                           (mean_imputer, ["reviews_per_month"]),
                                           remainder="drop"
                                           )

Once we have the transformer ready, we can use its `fit_transform()` method to impute the data, wrapping it in a Pandas `DataFrame` where we preserve the index from the original data for when we want to replace the original missing values. 

In [0]:
X_mean = pd.DataFrame(mean_transformer.fit_transform(X), columns=["reviews_per_month"], index=X.index)
X_mean.head()

In [0]:
X_mean.isna().sum()

As we can see, every missing value has been replaced. Let's compare this distribution of this feature before and after imputation:

In [0]:
sns.distplot(X["reviews_per_month"], bins=100)

In [0]:
sns.distplot(X_mean["reviews_per_month"], bins=100)

The different approaches for `SimpleImputer` are meant to help the user find a strategy that impact less the data distribution. For instance, we could have tried replacing the missing values with the mode:

In [0]:
mode_imputer = SimpleImputer(strategy="most_frequent")
mode_transformer = make_column_transformer(
                                           (mode_imputer, ["reviews_per_month"]),
                                           remainder="drop"
                                           )
X_mode = pd.DataFrame(mode_transformer.fit_transform(X), columns=["reviews_per_month"], index=X.index)
sns.distplot(X_mode["reviews_per_month"], bins=100)

Notice that, in this case, the imputed feature distribution is very similar whether the mean or mode has been used, but that may not always be the case.

### Based on multiple features

A more robust approach to data imputation is determining artificial values based on multiple features (preferably, the whole dataset). In the example below, we're gonna use the numerical features to aid us impute the missing values for `reviews_per_month`:

In [0]:
from sklearn.impute import KNNImputer
knn_imputer = KNNImputer()

Let's isolate the numerical features once again, this time including the feature we want to impute:

In [0]:
numerical = ["price", "minimum_nights","number_of_reviews", "reviews_per_month",
             "calculated_host_listings_count", "availability_365"]
X_num = X[numerical]

Imputing is performed the same way using `fit_transform()`, wrapping the result in a `DataFrame`and preserving column and index information:

> Note that the cell below takes a little longer to run, because a kNN model is fit internally.

In [0]:
X_imputed = pd.DataFrame(knn_imputer.fit_transform(X_num), columns=X_num.columns, index=X_num.index)
X_imputed.head()

In [0]:
X_imputed.isna().sum()

Let's compare the data distribution before and after imputation:

In [0]:
sns.distplot(X["reviews_per_month"], bins=100)

In [0]:
sns.distplot(X_imputed["reviews_per_month"], bins=100)

As we can see, the imputed version is now much more similar to the original data distribution. 