# Data Exploration and Processing

**Attention:** The code in this notebook creates Google Cloud resources that can incur costs.

Refer to the Google Cloud pricing documentation for details.

For example:

* [Vertex AI Pricing](https://cloud.google.com/vertex-ai/pricing)


In [None]:
# Start with required imports, and read the dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

source_dataset = "./data/AB_NYC_2019.csv"

# Load the dataset
data = pd.read_csv(source_dataset)

## Explore dataset and perform some analysis with pandas

Display the first few rows of the dataset to get a glimpse of the data:

In [None]:
data.head()

One of the first things we will often want to check when exploring a new dataset is what kinds of data (i.e., the schema) exist in there, and whether it looks like there are any missing data entries.

Let's get a summary of the dataset, including the number of non-null entries and data types (schema) for each column. We expect the id column to be populated in all of the rows in the dataset, so for any other columns that do not have the same number of non-null entries as the id column, it indicates that those columns may be missing some entries, which could cause problems if we were to train a machine learning model on this data.

In [None]:
data.info()

We also usually want to see some descriptive statistics for the columns in order to give us a better understanding of the contents of the dataset.

In [None]:
data.describe()

If we look at statistics such as the minumum, maximum, and mean values, we can see that the features are on different scales. For example, the maximum value for price is 10000, but the maximum value for reviews pr month is 58. We'll come back to this later.

Let's also find the number of unique values for each column:

In [None]:
data.nunique()

After checking some of the basics, we can now start to do some more advanced data analysis to get additional insights from the data...

Let's display the top 10 neighborhoods with the most Airbnb listings:

In [None]:
data['neighbourhood'].value_counts().head(10)

Calculate the average price for each room type:

In [None]:
data.groupby('room_type')['price'].mean()

Find the top 10 hosts with the most listings:

In [None]:
data['host_id'].value_counts().head(10)

Calculate the percentage of listings for each room type:

In [None]:
(data['room_type'].value_counts() / data.shape[0]) * 100

Find the average price per neighborhood:

In [None]:
data.groupby('neighbourhood')['price'].mean().sort_values(ascending=False)

Calculate the average availability (in days) for each room type:

In [None]:
data.groupby('room_type')['availability_365'].mean()

## Further exploration and visualization with matplotlib and seaborn

Let's start to visualize some of the characteristics of our dataset.
We'll start by creating a histogram to show the distribution of prices and help identify potential outliers.

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(data['price'], bins=100)
plt.title('Price Distribution')
plt.xlabel('Price')
plt.ylabel('Count')
plt.xlim(1,1000)
plt.show()

As we can see, the majority of the accommodation options cost less than 200 USD. There are a some datapoints (but not many) between 600 and 1000 USD. Those are either very expensive accommodation options, or they could be potential outliers/errors in the data.

Now, let's see how many different kinds of rooms types exist in the dataset:

In [None]:
plt.figure(figsize=(10, 5))
sns.countplot(data=data, x='room_type')
plt.title('Room Type Distribution')
plt.xlabel('Room Type')
plt.ylabel('Count')
plt.show()

As we would expect, the majority of options are entire homes/apartments, followed closely by private rooms, and very few of the options are shared rooms.

Now, let's review the distribution of minimum nights in the postings:

In [None]:
plt.figure(figsize=(10, 5))
sns.histplot(data['minimum_nights'], bins=100)
plt.title('Minimum Nights Distribution')
plt.xlabel('Minimum Nights')
plt.ylabel('Count')
plt.xlim(1,400)
plt.show()

This is interesting: there are some datapoints above 50 nights. This would be quite unusual for Airbnb, so this could indicate the potential presence of outliers or errors in the dataset.

# Clean up the Data

Now, let's clean up some of the potential issues we identified during our exploration above, such as missing values and outliers. If we assume that this data will be used to train a machine learning model, then these kinds of cleaning steps are usually necessary.

Use-case: prepare the data for a regression model that will try to predict the nightly room price rate based on the listing's other features.

Firstly, we want to handle missing values. In this dataset, the 'host_id' and 'host_name' columns have missing values. We can fill the missing 'host_id' values with the column mode (most frequent value) and drop the rows with missing 'host_name' values.

In [None]:
# Fill missing host_id values with the mode
host_id_mode = data['host_id'].mode()[0]
data['host_id'].fillna(host_id_mode, inplace=True)

# Drop rows with missing host_name values
data.dropna(subset=['host_name'], inplace=True)

In fact, we now also realize that some of the features in the dataset would not be valuable for training a machine learning model. For example, the following features are not likely to affect the price:  'id', 'name', 'host_name', 'last_review', and 'reviews_per_month'.

Let's remove those columns from our dataset:

In [None]:
columns_to_drop = ['id', 'name', 'host_name', 'last_review', 'reviews_per_month']
data_cleaned = data.drop(columns=columns_to_drop)

Now let's remove the price outliers, because they could inaccurately skew an ML model that's trained on this data. We can remove listings with extremely high or low prices by setting a reasonable price range. In this example, we'll consider listings with prices between 10 USD and 800 USD:

In [None]:
price_range = (data_cleaned['price'] >= 10) & (data_cleaned['price'] <= 800)
data_cleaned = data_cleaned.loc[price_range]

Let's also remove outliers in 'minimum_nights': we can cap the 'minimum_nights' column at an appropriate value, such as 30 days, to remove extreme outliers:

In [None]:
data_cleaned['minimum_nights'] = np.where(data_cleaned['minimum_nights'] > 30, 30, data_cleaned['minimum_nights'])

# Additional preprocessing for regression use-case

In addition to the clean up that we've performed, let's also perform some data transformations that will prepare the data for an ML use-case.

First, let's define our target variable (denoted as y) and separate that from the rest of the features (denoted as X):

In [None]:
# Feature selection
features = ['neighbourhood_group', 'neighbourhood', 'latitude', 'longitude',
            'room_type', 'minimum_nights', 'number_of_reviews',
            'calculated_host_listings_count', 'availability_365']
target = 'price'

X = data_cleaned[features]
y = data_cleaned[target]


Predicting the price is an example of a regression use-case. For regression use-cases, we want to convert all of the categorical (i.e., non-numeric) features in our dataset into numeric values, and we can use something called One-hot encoding to do that, which can be performed with the pandas [get_dummies()](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html) function:

In [None]:
X_encoded = pd.get_dummies(X, columns=['neighbourhood_group', 'neighbourhood', 'room_type'], drop_first=True)

As we saw in the statistical distributions of our column values earlier in this notebook, the numerical features in our dataset are on very different scales. This could inaccurately skew an ML model that's trained on this data, because it might think that the larger scale features are more important or impactful. To remove this potential problem, we will change all of the numerical features to be on a standard scale. We can use the StandardScaler class from scikit-learn for this purpose. For more information, see [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

In [None]:
from sklearn.preprocessing import StandardScaler

numerical_features = ['latitude', 'longitude', 'minimum_nights', 'number_of_reviews',
                      'calculated_host_listings_count', 'availability_365']

scaler = StandardScaler()
X_encoded[numerical_features] = scaler.fit_transform(X_encoded[numerical_features])


In [None]:
X_encoded.head()

Now our dataset should be ready to use for training a regression model. We'll leave that activity for another time. In large companies, the tasks of preparing data and training models are often separated, and performed by different people or teams.