## Simple tutorial on exploratory model building on a subset of the Airbnb dataset.

To run this notebook, you need to download the data from [here](https://insideairbnb.com/get-the-data) and put it in the same directory as this notebook. The functionality of this notebook was tested with the data from Vienna.


In this introductory notebook, we will take a closer look at the Airbnb data. This is by no means a "perfect" analysis, but serves only as an introduction to how one might approach a problem such as the one at hand. So the analyses performed are only meant as a kind of inspiration for your own models.

The variable selection is arbitrary and not every potentially valuable variable is used. We will for instance only use some of the structural variables and the images, but no text data.

In this simple tutorial you will learn:

* How to read in and read out the images for modeling.
* How to deal with missing data
    * Simple imputation transformations to fill in missing values, such as using the median
* How to preprocess structural variables
* Creating base models for the structural variables for data exploration.
    * Simple linear regression
    * Decision Tree Regression
    * Random Forest Regression
* How to create a Multi Layer Perceptron (MLP) with Pytorch.
* How to build a basic Convolutional Neural Network (CNN) with Pytorch.
* Predicting the RMSE using your CNN architecture.


## Index
#### 1. Load the data and scrape the images

#### 2. Data Exploration

#### 3. Model building using only the structural variables

#### 4. Model building using the images


# 1. Load the data

In [1]:
import pandas as pd
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import requests
from io import BytesIO
import matplotlib.image as mpimg
from tqdm import tqdm
import seaborn as sns

I chose only to use the first 500 observations of the available data to have a faster computation time.
In this tutorial we will not use the 4 datasets loaded below but will solely focus on the listings.csv.gz file.

In [2]:
listings = pd.read_csv("listings.csv")[0:1500]
reviews = pd.read_csv("reviews.csv")[0:1500]
reviews_meta = pd.read_csv("reviews.csv.gz")[0:1500]
calendar = pd.read_csv("calendar.csv.gz")[0:1500]

Load the main data file

In [3]:
train_df = pd.read_csv("listings.csv.gz")[0:1500]

In [4]:
# drop all rows where variables price, host_is_superhost or pictures are missing
train_df = train_df.dropna(subset=["picture_url", "host_picture_url", "price", "host_is_superhost"])
train_df = train_df.reset_index(drop=True)

Unfortunately, the variable we are interested in, the price, is given in a string format in the form of  "10.00$".

For easier computation we transform these strings into floats

In [5]:
# transform price
train_df["price"] = (
    train_df["price"].str.replace("$", "").str.replace(",", "").astype(float)
)

Additionaly let's compute the log price, as predicting the log price gives us some advantages we will see later in the notebook

In [6]:
train_df["log_price"] = np.log(train_df["price"])

As you can see below, we have an enormous amount of available data (75 columns + additional files). However, some of these variables contain essentially the same or extremely correlated information. For example, I would guess that the sentiment of a written review is highly correlated with the given *star* rating.

Let's have a quick look at the data. We have 75 columns with a lot of variables that probably do not bare any explanation in them at all.

In [None]:
train_df.head()

### "Scrape" the images

First, we try to scrape the available images: 

This loop, as terrible as it is written, loops through the given flat-picture and host-picture URL's and appends the scraped pictures to two lists. We resize the images to a size of 224 by 224 pixels and normalize them by dividing through 255 (the maximum pixel value). All images that cannot be scraped or are not with rgb are dropped from the dataframe. This is obviously optional and can be handled differently by your approach.

The used picture size of 224x224 is also done randomly. Have in mind, however, that larger pixel values (while maybe giving you better prediction results) will give you longer computation times. However, using e.g. 8x8 images as is sometimes done with the MNIST data sets will give you unrecognizable flat images.

Depending on your number of observations this can take a while. Therefore it is sensible for you (if you haven't already) to start your analysis by scrapping all images and (obviously) saving them.

In [None]:
from PIL import Image
from io import BytesIO
import requests
import numpy as np
from tqdm import tqdm

pictures_flat = []
pictures_host = []
drop_indices = []

for i in tqdm(range(len(train_df))):
    try:
        # Get image URLs
        url_flat = train_df["picture_url"].iloc[i]
        url_host = train_df["host_picture_url"].iloc[i]

        # Try downloading the images with timeout and error handling
        response_flat = requests.get(url_flat, timeout=5)
        response_host = requests.get(url_host, timeout=5)

        response_flat.raise_for_status()
        response_host.raise_for_status()

        # Open, resize, normalize
        img_flat = Image.open(BytesIO(response_flat.content)).convert("RGB").resize((64, 64))
        img_host = Image.open(BytesIO(response_host.content)).convert("RGB").resize((64, 64))

        img_flat = np.array(img_flat) / 255.0
        img_host = np.array(img_host) / 255.0

        # Final shape check
        if img_flat.shape != (64, 64, 3) or img_host.shape != (64, 64, 3):
            raise ValueError("Unexpected image shape")

        # Append processed images
        pictures_flat.append(img_flat)
        pictures_host.append(img_host)

    except Exception as e:
        # Log error and mark index for dropping
        print(f"Skipping index {i} due to error: {e}")
        drop_indices.append(i)

# Drop bad rows from DataFrame
train_df = train_df.drop(index=drop_indices).reset_index(drop=True)
train_df.to_pkl("train_df.pkl")


In [8]:
train_df = pd.read_pickle("train_df.pkl")
pictures_host = train_df["pic_host"]
pictures_flat = train_df["pic_flat"]

In [9]:
train_images_flat = np.asarray(pictures_flat)
train_images_host = np.asarray(pictures_host)

In [None]:
np.median(train_df["price"])

# 2. Data Exploration

First we take a closer look at the price distribution of the given data.
We can clearly see that we have a wide range of prices and very few observations at the very top of the price range.
The average airbnb apartment is rented for roughly 100$ per night. The median lies is a little lower.

In [None]:
sns.set(rc={"figure.figsize": (15, 5)})
fig = plt.figure()
sns.histplot(data=train_df, x="price", bins=50)
plt.axvline(train_df["price"].mean(), c="red", ls="-", lw=3, label="Mean Price")
plt.axvline(train_df["price"].median(), c="blue", ls="-", lw=3, label="Median Price")
plt.title("Distribution of Airbnb Prices", fontsize=20, fontweight="bold")
plt.legend()
plt.show()

Sometimes it might be advisable to take a closer look at the log-distribution of the prices, which, often depending on the chosen sample size, looks pretty much like a normal distribution.

In [None]:
sns.set(rc={"figure.figsize": (15, 5)})
fig = plt.figure()
sns.histplot(data=train_df, x="log_price", bins=50)
plt.axvline(train_df["log_price"].mean(), c="red", ls="-", lw=3, label="Mean Log Price")
plt.axvline(
    train_df["log_price"].median(),
    c="blue",
    ls="-",
    lw=3,
    label="Median Log Price",
)
plt.title("Distribution of Airbnb Log Prices", fontsize=20, fontweight="bold")
plt.legend()
plt.show()

Additionaly we take a closer look at some of the variables which are chosen at random, although I do expect that the chosen 3 variables, namely the room type, whether the host is a so-called "superhost" and the number of people the apartment accomodates has a significant impact on the rental price. However, this is by no means validated and thus you should not be too strongly influenced by my choice.

In [None]:
feature_variables = [
    "room_type",
    "host_is_superhost",
    "accommodates",
]

# for each of the above listed feature variables
# show a boxplot and distribution plot against the log price
for variable in feature_variables:
    fig, ax = plt.subplots(1, 2)
    sns.boxplot(data=train_df, x=variable, y="log_price", ax=ax[0])
    sns.histplot(train_df, x="log_price", hue=variable, kde=True, ax=ax[1])
    plt.suptitle(variable, fontsize=20, fontweight="bold")
    fig.show()

Now, we take a look at the host images and the flat images. For this, we write a simple function that plots the stored images and labels them with the given apartment rental price. 

In [14]:
def plot_images_and_prices(
    df, desired_price=None, num_images=5, random_state=202, pics="pic_host"
):
    """plots images and respective prices

    Args:
        df (pd.DataFrame): DataFrame eveyrthing is stored in. Prices and images as arrays
        desired_price (float, optional): If you want to only view images with a specified rental price. Defaults to None.
        num_images (int, optional): Number of images to be plotted. Defaults to 5.
        random_state (int, optional): . Defaults to 101.
        pics (str, optional): column Name in df where images are stored. Defaults to "pic_host".
    """

    num_images = num_images

    # set the rample state for the sampling for reproducibility
    random_state = random_state

    # only select entries with given price when specified
    if desired_price != None:
        random_sample = (
            df[df["price"] == desired_price]
            .sample(num_images, random_state=random_state)
            .reset_index(drop=True)
        )
    else:
        random_sample = df.sample(num_images, random_state=random_state)

    for i in range(num_images):

        price = random_sample.iloc[i]["price"]
        plt.subplot(1, num_images, i + 1)

        title = price
        plt.title(title)

        # turn off gridlines
        plt.axis("off")

        plt.imshow(random_sample.iloc[i][pics])
    plt.show()
    plt.close()

In [None]:
# plot the hosts
plot_images_and_prices(train_df)

In [None]:
# plot the flats
plot_images_and_prices(train_df, pics="pic_flat")

# 3. Model building

To create a model, simpler models are first examined. This can be used, for example, to select the appropriate variables, since simple models often offer good interpretability. 

In [17]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
import matplotlib.patches as mpatches
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.impute import SimpleImputer
from math import e

First, we select only a few of the variables for easier computation. An adewuate variable selection will be of crucial importance for your model, however, for this simple introduction I just picked what I thought might be reasonable. Additionaly, I chose variables that might be correlated with one another as bedrooms and beds, but did not adjust for that in my modeling as you might see later.

We choose the log-price as Y, as we will start with a simple linear regression where the normality assumption is crucial.

As the chosen feature variables we selected:

* "host_is_superhost"
* "latitude"
* "longitude"
* "room_type"
* "accommodates"
* "bedrooms"
* "number_of_reviews"
* "review_scores_value"


Most of these are self explanatory, but you can look up their exact meaning at: https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit#gid=982310896

In [18]:
Y = train_df["log_price"]

train_df = train_df[
    [
        "host_is_superhost",
        "latitude",
        "longitude",
        "room_type",
        "accommodates",
        "bedrooms",
        "minimum_nights",
        "number_of_reviews",
        "review_scores_value",
        "host_identity_verified",
    ]
]

There are several ways to deal with missing data. The obvious one is to simply drop all rows where you encounter any missing (or unreasonable, e.g. 10 billion beds) values. Sometimes this might even be advisable. However, I chose to use a fairly straight forward method by simply replacing the missing values with the respective median values of the features. I chose the median, as the mean of e.g. number of bedrooms is not really meaningful, with e.g. 1.19.

Additionally we dummy decode the column "room_type", hus creating additional columns.

In [None]:
train_df.isna().sum() # When considering only the first 150 images, there are not so many missing values. Data processing will get important when using the full dataset.

In [20]:
imputer = SimpleImputer(strategy="median")


imputer = imputer.fit(
    train_df[
        [
            "accommodates",
            "bedrooms",
            "number_of_reviews",
            "review_scores_value",
            "minimum_nights",
        ]
    ]
)

train_df[
    [
        "accommodates",
        "bedrooms",
        "number_of_reviews",
        "review_scores_value",
        "minimum_nights",
    ]
] = imputer.transform(
    train_df[
        [
            "accommodates",
            "bedrooms",
            "number_of_reviews",
            "review_scores_value",
            "minimum_nights",
        ]
    ]
)

In [None]:
rooms = pd.get_dummies(train_df["room_type"], prefix="room")
train_df = train_df.drop("room_type", axis=1)
train_df["host_is_superhost"] = train_df["host_is_superhost"].map(dict(t=1, f=0))
train_df["host_identity_verified"] = train_df["host_identity_verified"].map(
    dict(t=1, f=0)
)
train_df = pd.concat([train_df, rooms], axis=1)
train_df

In [None]:
train_df.isna().sum()

For a visual interpretation of our following model results we define a simple plotting function that plots the actual prices vs our predicted prices

In [23]:
# let's see what our predictions look like vs the actual
def ActualvPredictionsGraph(test_y, pred_y, title, prob=False):
    if max(test_y) >= max(pred_y):
        my_range = int(max(test_y))
    else:
        my_range = int(max(pred_y))
    plt.figure(figsize=(12, 3))
    plt.scatter(range(len(test_y)), test_y, color="blue")
    plt.scatter(range(len(pred_y)), pred_y, color="red")
    plt.xlabel("Index ")
    plt.ylabel("Log_price")
    plt.title(title, fontdict={"fontsize": 15})
    plt.legend(
        handles=[
            mpatches.Patch(color="red", label="prediction"),
            mpatches.Patch(color="blue", label="actual"),
        ]
    )
    plt.show()

    if prob:
        # plot actual v predicted in histogram form
        plt.figure(figsize=(12, 4))
        sns.histplot(predict_test, color="r", alpha=0.3, stat="probability", kde=True)
        sns.histplot(test_y, color="b", alpha=0.3, stat="probability", kde=True)
        plt.legend(labels=["prediction", "actual"])
        plt.title("Actual v Predict Distribution" + str(title))
        plt.show()

Before we start modelling, we need to agree on an evaluation method. Obviously for a simple linear regression we could use the R^2, adjusted R2 or AIC metric, but as later we are introducing neural networks I want to use a simple and straight forward method of evluating the model predictions. For that purpose we will use the Root Mean Square Error (RMSE) metric.

\begin{equation}
RMSE = \sqrt {\sum_{i=1}^{n} \frac{\hat{y_i} - y_i}{n}},
\end{equation}

where $\hat{y}$ is the predicted price. Be aware that the imported function is only the Mean Squared Error (MSE) which is why we take the squareroot by using **0.5


For that purpose we will split our data set into a training data set and a testing data set.

In [24]:
train_x, test_x, train_y, test_y = train_test_split(
    train_df, Y, test_size=0.25, random_state=0
)

First, lets try a "stupid model" that only predicts the mean of the training data, in our case the mean log price of the training data. 

In [25]:
def most_basic_model_prediction(y_train, y_test):
    a = np.empty(len(test_y))
    a.fill(np.mean(train_y))
    return a


pred_y = most_basic_model_prediction(train_y, test_y)

In [26]:
print("RMSE on test data: ", mean_squared_error(test_y, pred_y) ** (0.5))

RMSE on test data:  0.5278642627476925


In [None]:
ActualvPredictionsGraph(test_y, pred_y, "Actual v. Mean")

## Linear Regression Model

You should all be familiar with the simple linear regression model. 
\begin{equation}
\mathbf{Y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}
\end{equation}
However, you might not have fitted a regression using python. 

To use a linear regression in python is similarly easy as it is using it in R. Unfortunately, sklearn does not offer a summary as nice as the R-version does. If you are interested in a similar regression output you could use e.g. statsmodels or create the output yourself by e.g. using sklearns metrics as the introduced MSE.

In [None]:
model_LR = LinearRegression()

# fit the model with the training data
model_LR.fit(train_x, train_y)

predict_train = model_LR.predict(train_x)
predict_test = model_LR.predict(test_x)
# Root Mean Squared Error on train and test date

print("RMSE on train data: ", mean_squared_error(train_y, predict_train) ** (0.5))
print("RMSE on test data: ", mean_squared_error(test_y, predict_test) ** (0.5))

In [None]:
# plot it
ActualvPredictionsGraph(test_y, predict_test, "Actual v. Predicted LM", prob=True)

Decision trees are algorithms that can be used in both, regression and classification. They are basically using a set of binary rules and are thus very easy to interpret. A decision tree, just as a real tree has subsequently branches, nodes and leaves. See [here](https://www.datacamp.com/tutorial/decision-tree-classification-python) to get the basic idea of a decision tree if you want.

In [30]:
tree_reg = DecisionTreeRegressor(max_depth=3, min_samples_split=5)

# train the model
tree_reg.fit(train_x, train_y)

# predict the response for the test data
predict_train = tree_reg.predict(train_x)
predict_test = tree_reg.predict(test_x)

# Root Mean Squared Error on train and test date
print("RMSE on train data: ", mean_squared_error(train_y, predict_train) ** (0.5))
print("RMSE on test data: ", mean_squared_error(test_y, predict_test) ** (0.5))

RMSE on train data:  0.41953133106249946
RMSE on test data:  0.46540733261356293


In [None]:
ActualvPredictionsGraph(test_y, predict_test, "Actual v. Predicted DT", prob=True)

In [None]:
# visualize the decision tree
fig = plt.figure(figsize=(15, 5))
plot = tree.plot_tree(
    tree_reg, feature_names=train_x.columns.values.tolist(), filled=True
)

The overall performance of the decision trees is fairly similar to the performance of a simple linear regression. However, wee see an accumulation of predictions around certain prices due to the inherent model structure of binary decisions.

A similar and very common approach is a random forest regression:

In [33]:
model_RFR = RandomForestRegressor(max_depth=10)

# fit the model with the training data
model_RFR.fit(train_x, train_y)

# predict the target on train and test data
predict_train = model_RFR.predict(train_x)
predict_test = model_RFR.predict(test_x)

# Root Mean Squared Error on train and test date
print("RMSE on train data: ", mean_squared_error(train_y, predict_train) ** (0.5))
print("RMSE on test data: ", mean_squared_error(test_y, predict_test) ** (0.5))

RMSE on train data:  0.17320630562633219
RMSE on test data:  0.39184350231160225


In [None]:
ActualvPredictionsGraph(test_y, predict_test, "Actual v. Predicted RF", prob=True)


In [None]:
plt.figure(figsize=(10, 7))
feat_importances = pd.Series(model_RFR.feature_importances_, index=train_x.columns)
feat_importances.nlargest(train_df.shape[1]).plot(kind="barh")

This plot gives the "importance" of our input variables/features in the model. This could e.g. give us a reason to drop some of our used variables.

Now we want to build our first Multi Layer Perceptron. In this Notebook I will be using firstly the very inflexible MLPRegressor from scikit-learn and secondly a simple Pytorch Implementation. You are, however, by no means in any way restricted and can use any other library you find suitable.

Commonly the weights in neural networks are initialized to random small numbers (most often with a random draw from a truncated normal distribution with mean=0 and variance=1) and updated with a given (and often changing/flexible) small learning rate.
Given this use of small weights the scale of inputs (and outputs) is an important factor. Unscaled input variables can result in a slow or unstable learning process which in turn can lead to a bad performance. Unscaled output variables (in our case the price) can lead to exploding gradients.

As we have already scaled our prices with a log-transformation we will apply a MinMax Scaling approach to our input variables.



\begin{equation}
x_{scaled} = \frac{x - min(x)}{max(x) - min(x)}
\end{equation}


As you can see in the formula this has no effect on the dummy variables. However, our implementation has an effect on the "categorical" variables as e.g. "accommodates" or "beds". It might therefore maybe be more approriate to use e.g. One-Hot-Encoding for the categorical variables. I have not implemented this in the present notebook as this is purely introductory, so you should definetly have in mind that the variables selection, feature extraction and preprocessing is a very important part of your model building and should not simply be copied from this notebook 

In [36]:
# try with minmax scaled data -> also usefull for MLP
mn = MinMaxScaler()
x_train_scaled = pd.DataFrame(mn.fit_transform(train_x), columns=train_x.columns)
x_test_scaled = pd.DataFrame(mn.fit_transform(test_x), columns=test_x.columns)

In [37]:
MLPreg = MLPRegressor(
    hidden_layer_sizes=(10,10,10),
    activation="relu",
    random_state=1,
    max_iter=10000,
).fit(x_train_scaled, train_y)

In [38]:
predict_train = MLPreg.predict(x_train_scaled)
predict_test = MLPreg.predict(x_test_scaled)
# Root Mean Squared Error on train and test data
print("RMSE on train data: ", mean_squared_error(train_y, predict_train) ** (0.5))
print("RMSE on test data: ", mean_squared_error(test_y, predict_test) ** (0.5))

RMSE on train data:  0.4325506982942492
RMSE on test data:  0.44385867380905664


In [None]:
ActualvPredictionsGraph(test_y, predict_test, "Actual v. Predicted MLP", prob=True)

Now let's implement a simple MLP with PyTorch

In [40]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader, random_split
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error


In [41]:
# First define the validation split, number of epochs, and the batch size
split = 0.2
epochs = 2000
batch_size = 64

# Convert pandas DataFrames/Series to numpy arrays first
x_train_tensor = torch.tensor(x_train_scaled.to_numpy(), dtype=torch.float32)
y_train_tensor = torch.tensor(train_y.to_numpy(), dtype=torch.float32).view(-1, 1)
x_test_tensor = torch.tensor(x_test_scaled.to_numpy(), dtype=torch.float32)
y_test_tensor = torch.tensor(test_y.to_numpy(), dtype=torch.float32).view(-1, 1)

# Create dataset and dataloaders
full_dataset = TensorDataset(x_train_tensor, y_train_tensor)
val_size = int(len(full_dataset) * split)
train_size = len(full_dataset) - val_size
train_dataset, val_dataset = random_split(full_dataset, [train_size, val_size])

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)


In [42]:
# Define a super simple model with just 1 layer

class SimpleMLP(nn.Module):
    def __init__(self, input_dim):
        super(SimpleMLP, self).__init__()
        self.fc = nn.Linear(input_dim, 1)

    def forward(self, x):
        return self.fc(x)

model = SimpleMLP(x_train_scaled.shape[1])
print(model)


SimpleMLP(
  (fc): Linear(in_features=12, out_features=1, bias=True)
)


In [43]:
# Define loss function and optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters())


In [None]:
# Training loop
train_losses = []
val_losses = []

for epoch in range(epochs):
    model.train()
    epoch_train_loss = 0
    for xb, yb in train_loader:
        pred = model(xb)
        loss = criterion(pred, yb)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_train_loss += loss.item() * xb.size(0)

    train_losses.append(epoch_train_loss / train_size)

    model.eval()
    with torch.no_grad():
        val_loss = sum(criterion(model(xb), yb).item() * xb.size(0) for xb, yb in val_loader)
        val_losses.append(val_loss / val_size)

    if (epoch + 1) % 20 == 0 or epoch == 0:
        print(f"Epoch {epoch+1}: Train Loss = {train_losses[-1]:.4f}, Val Loss = {val_losses[-1]:.4f}")


In [None]:
# Step 4.2: Plot the learning curve
plt.plot(train_losses, label="train")
plt.plot(val_losses, label="val")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()


In [None]:
# Evaluate on test data
model.eval()
with torch.no_grad():
    predict_train = model(x_train_tensor).numpy()
    predict_test = model(x_test_tensor).numpy()

print("RMSE on train data: ", mean_squared_error(train_y, predict_train) ** 0.5)
print("RMSE on test data: ", mean_squared_error(test_y, predict_test) ** 0.5)


# Activation Functions

For this implementation, we will use the standard Rectified Linear Uni activation function. However there are obviously multiple other alternatives which could be used.

### Rectified Linear Activation Function:

\begin{equation}
f (x) = \left\{
\begin{array}{ll}
x & x > 0 \\
0 & \, \textrm{else} \\
\end{array}
\right.
\end{equation}

The ReLU activation function is pretty simple. It is 0, once x is smaller or equal to zero and x, whenever x is bigger than zero. We could write this function in a loop simply as:
```
outputs = []
for value in inputs:
    if value > 0:
        outputs.append(x)
    elif value <= 0:
        outputs.append(0)    
```

However, this can easily be done in one line, by simply taking the maximum between the value itself and 0:

```
np.maximum(0, value)
```

The derivative of the ReLU function is equally as easy and simply not defined for x = 0:

\begin{equation}
f (x) = \left\{
\begin{array}{ll}
1 & x > 0 \\
0 & x < 0 \\
\end{array}
\right.
\end{equation}

In [47]:
def ReLU(inputs):
    return np.maximum(0, inputs)

def linear(inputs):
    return inputs


In [None]:
x = np.linspace(-10, 10, 500)
plt.plot(x, ReLU(x))
plt.title("Activation Function :ReLu")
plt.show()

More Complex MLP with PyTorch


In [49]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, random_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
import numpy as np


In [50]:
# Hyperparameters
split = 0.2
epochs = 2000
batch_size = 64
learning_rate = 0.001
early_stopping_patience = 4

In [51]:
# Convert pandas DataFrames/Series to PyTorch tensors
x_train_tensor = torch.tensor(x_train_scaled.to_numpy(), dtype=torch.float32)
y_train_tensor = torch.tensor(train_y.to_numpy(), dtype=torch.float32).view(-1, 1)
x_test_tensor = torch.tensor(x_test_scaled.to_numpy(), dtype=torch.float32)
y_test_tensor = torch.tensor(test_y.to_numpy(), dtype=torch.float32).view(-1, 1)

# Dataset and DataLoader
full_dataset = TensorDataset(x_train_tensor, y_train_tensor)
val_size = int(len(full_dataset) * split)
train_size = len(full_dataset) - val_size
train_dataset, val_dataset = random_split(full_dataset, [train_size, val_size])

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)


In [None]:
# Complex MLP with ReLU activations and optional dropout
class ComplexMLP(nn.Module):
    def __init__(self, input_dim, use_dropout=False):
        super().__init__()
        layers = [
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 128),
            nn.ReLU()
        ]
        
        if use_dropout:
            layers.append(nn.Dropout(0.5))
        
        layers += [
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1)
        ]

        self.model = nn.Sequential(*layers)

    def forward(self, x):
        return self.model(x)

# Instantiate the model
model = ComplexMLP(x_train_tensor.shape[1], use_dropout=False)
print(model)


In [53]:
# Training setup
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)


In [None]:
# Training with manual early stopping
train_losses = []
val_losses = []
best_val_loss = float('inf')
patience_counter = 0

for epoch in range(epochs):
    model.train()
    epoch_train_loss = 0
    for xb, yb in train_loader:
        pred = model(xb)
        loss = criterion(pred, yb)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_train_loss += loss.item() * xb.size(0)

    epoch_train_loss /= train_size
    train_losses.append(epoch_train_loss)

    model.eval()
    with torch.no_grad():
        val_loss = sum(criterion(model(xb), yb).item() * xb.size(0) for xb, yb in val_loader)
        val_loss /= val_size
        val_losses.append(val_loss)

    print(f"Epoch {epoch+1}: Train Loss = {epoch_train_loss:.4f}, Val Loss = {val_loss:.4f}")

    # Early stopping logic
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        best_model_state = model.state_dict()
    else:
        patience_counter += 1
        if patience_counter >= early_stopping_patience:
            print(f"Stopping early at epoch {epoch+1}")
            break

# Load best model
model.load_state_dict(best_model_state)


In [None]:
# Plot the learning curve
plt.plot(train_losses, label="train")
plt.plot(val_losses, label="val")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()


In [None]:
# Evaluate on train and test data
model.eval()
with torch.no_grad():
    predict_train = model(x_train_tensor).numpy()
    predict_test = model(x_test_tensor).numpy()

print("RMSE on train data: ", mean_squared_error(train_y, predict_train) ** 0.5)
print("RMSE on test data: ", mean_squared_error(test_y, predict_test) ** 0.5)


Using Dropout to Mitigate Overfitting


In [None]:
# Reinitialize model with dropout and retrain
model = ComplexMLP(x_train_tensor.shape[1], use_dropout=True)
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

train_losses = []
val_losses = []
best_val_loss = float('inf')
patience_counter = 0

for epoch in range(epochs):
    model.train()
    epoch_train_loss = 0
    for xb, yb in train_loader:
        pred = model(xb)
        loss = criterion(pred, yb)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_train_loss += loss.item() * xb.size(0)

    epoch_train_loss /= train_size
    train_losses.append(epoch_train_loss)

    model.eval()
    with torch.no_grad():
        val_loss = sum(criterion(model(xb), yb).item() * xb.size(0) for xb, yb in val_loader)
        val_loss /= val_size
        val_losses.append(val_loss)

    print(f"Epoch {epoch+1}: Train Loss = {epoch_train_loss:.4f}, Val Loss = {val_loss:.4f}")

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        patience_counter = 0
        best_model_state = model.state_dict()
    else:
        patience_counter += 1
        if patience_counter >= early_stopping_patience:
            print(f"Stopping early at epoch {epoch+1}")
            break

model.load_state_dict(best_model_state)


In [None]:
# Plot the learning curve with dropout
plt.plot(train_losses, label="train")
plt.plot(val_losses, label="val")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()


In [59]:
# Evaluate with dropout model
model.eval()
with torch.no_grad():
    predict_train = model(x_train_tensor).numpy()
    predict_test = model(x_test_tensor).numpy()

print("RMSE on train data (dropout): ", mean_squared_error(train_y, predict_train) ** 0.5)
print("RMSE on test data (dropout): ", mean_squared_error(test_y, predict_test) ** 0.5)


RMSE on train data (dropout):  0.46103065079775535
RMSE on test data (dropout):  0.45557599217372446


# 4. Model building with images

Now lets first try a simple regression simply with the images of the flats. Spoiler: The results are really bad (worse than mean prediction). So this is really just a demonstration of how such models roughly look. 

In [60]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader, random_split
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


In [61]:
# Split image data
img_train_x, img_test_x, img_train_y, img_test_y = train_test_split(
    train_images_flat, Y, test_size=0.25, random_state=0
)

# Split image data
img_train_x, img_test_x, img_train_y, img_test_y = train_test_split(
    train_images_flat, Y, test_size=0.25, random_state=0
)

# Fix: convert object-type arrays to uniform float32 arrays
img_train_x = np.stack(img_train_x).astype(np.float32)
img_test_x = np.stack(img_test_x).astype(np.float32)

# Convert image arrays from NHWC to NCHW for PyTorch
img_train_x = torch.tensor(img_train_x).permute(0, 3, 1, 2)
img_test_x = torch.tensor(img_test_x).permute(0, 3, 1, 2)

# Convert labels to torch tensors
img_train_y = torch.tensor(img_train_y.to_numpy(), dtype=torch.float32).view(-1, 1)
img_test_y = torch.tensor(img_test_y.to_numpy(), dtype=torch.float32).view(-1, 1)

# Create datasets and dataloaders
train_ds = TensorDataset(img_train_x, img_train_y)
test_ds = TensorDataset(img_test_x, img_test_y)
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
test_dl = DataLoader(test_ds, batch_size=batch_size)



In [62]:
# CNN Model
class CNNRegressor(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=3, stride=2, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2),

            nn.Conv2d(64, 128, kernel_size=3, stride=2, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2),

            nn.Conv2d(128, 64, kernel_size=3, stride=2, padding=1),
            nn.ReLU(),

            nn.Dropout(0.5),
            nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
        )
        self.flatten = nn.Flatten()
        self.fc = nn.Sequential(
            nn.Linear(64 * 2 * 2, 32),
            nn.ReLU(),
            nn.Linear(32, 1)
        )

    def forward(self, x):
        x = self.conv(x)
        x = self.flatten(x)
        x = self.fc(x)
        return x

cnn_model = CNNRegressor()


In [None]:
# Train CNN
def train_model(model, train_dl, val_dl=None, epochs=100):
    opt = optim.Adam(model.parameters(), lr=0.001)
    loss_fn = nn.MSELoss()
    train_loss, val_loss = [], []

    for epoch in range(epochs):
        model.train()
        epoch_loss = 0
        for xb, yb in train_dl:
            preds = model(xb)
            loss = loss_fn(preds, yb)
            opt.zero_grad()
            loss.backward()
            opt.step()
            epoch_loss += loss.item() * xb.size(0)

        train_loss.append(epoch_loss / len(train_dl.dataset))

        if val_dl:
            model.eval()
            val_epoch_loss = 0
            with torch.no_grad():
                for xb, yb in val_dl:
                    val_epoch_loss += loss_fn(model(xb), yb).item() * xb.size(0)
            val_loss.append(val_epoch_loss / len(val_dl.dataset))

        print(f"Epoch {epoch+1}: train loss = {train_loss[-1]:.4f}")

    return train_loss, val_loss

cnn_model = CNNRegressor()
train_losses, _ = train_model(cnn_model, train_dl, epochs=100)


In [None]:
# Evaluate CNN
cnn_model.eval()
with torch.no_grad():
    pred_train = cnn_model(img_train_x).numpy()
    pred_test = cnn_model(img_test_x).numpy()

print("RMSE train:", mean_squared_error(img_train_y, pred_train) ** 0.5)
print("RMSE test:", mean_squared_error(img_test_y, pred_test) ** 0.5)

plt.plot(train_losses, label="train")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()


As this is just an introductory notebook, I again want to emphasize that the performance of these models is arbitrary and should not effect your model or variable choice. I did not try to get these models to work. I just wanted to showcase you some stuff.

In this notebook we have started with very simple methods, suited to analyse structured data. Even in the performed models any form of optimizing is missing and the variable selection was done at random. Therefore you have plenty of room to build better fitting models using more suitable variables.

Subsequently we have used 2 types of neural networks:
* Multi Layer Perceptrons to analyze structured data
* Convolutional Neural networks to analyze images

