# Comprehensive Analysis of Rental Property Data Using Linear Regression

This notebook aims to explore the dynamics of rental prices using linear regression analysis. The dataset comprises rental property listings from 2018 to 2023, including features such as location, size, amenities, and price. Our goal is to identify significant predictors of rental prices and understand the impact of various property features on price. This analysis will provide insights into the rental market, aiding investors, real estate professionals, and policy-makers in making informed decisions.

In [None]:
# Load necessary libraries
library(tidyverse)

# Reading the dataset
data_rent <- read.csv("./rent_merge_2018_2023.csv", dec = ",", header = TRUE, sep = ";")
# Convert categorical variables to factors (dummy variables)
categorical_vars <- c('gym', 'field_quadra', 'elevator', 'furnished', 'swimming_pool')
data_rent[categorical_vars] <- lapply(data_rent[categorical_vars], as.factor)

# Quick summary to check the conversion
summary(data_rent[categorical_vars])

### Dataset Overview

The dataset, `rent_merge_2018_2023.csv`, includes detailed listings of rental properties. Key variables include square footage, number of bedrooms and bathrooms, presence of amenities (gym, swimming pool, furnished status), and rental price. We will preprocess the data to handle categorical variables appropriately and ensure it is ready for linear regression analysis.

In [None]:
data_rent <- read.csv("./rent_merge_2018_2023.csv", dec = ",", header = T, sep = ";")

### Preprocessing: Handling Categorical Variables

In preparation for linear regression, it's crucial to convert categorical variables into a format that can be utilized in the model. We treat amenities such as gyms, swimming pools, and furnishing status as dummy variables (0 or 1) to indicate their absence or presence. This conversion allows us to quantify their impact on rental prices effectively.

In [None]:
data_rent$gym <- as.factor(data_rent$gym)
data_rent$field_quadra <- as.factor(data_rent$field_quadra)
data_rent$elevator <- as.factor(data_rent$elevator)
data_rent$furnished <- as.factor(data_rent$furnished)
data_rent$swimming_pool <- as.factor(data_rent$swimming_pool)

### Treating the date column

In [None]:
data_rent$date <- as.Date(data_rent$date, format = "%d/%m/%Y")

In [None]:
data_rent$date <- as.Date(data_rent$date, format = "%d/%m/%Y")

### Filtering the years

In [None]:
data_2018 <- filter(data_rent, year(date) == 2018)

In [None]:
data_2019 <- filter(data_rent, year(date) == 2019)

In [None]:
data_2020 <- filter(data_rent, year(date) == 2020)

In [None]:
data_2021 <- filter(data_rent, year(date) == 2021)

In [None]:
data_2022 <- filter(data_rent, year(date) == 2022)

In [None]:
data_2023 <- filter(data_rent, year(date) == 2023)

### Multiple regression by years

In [None]:
reg <- lm(price_real_month ~ area_m2 +
  bedrooms +
  suite +
  bathrooms +
  garage +
  condo_real +
  metro_dist_km +
  gym +
  field_quadra +
  elevator +
  furnished +
  swimming_pool, data = data_rent)
summary(reg)

In [None]:
reg_2018 <- lm(price_real_month ~ area_m2 +
  bedrooms +
  suite +
  bathrooms +
  garage +
  condo_real +
  metro_dist_km +
  gym +
  field_quadra +
  elevator +
  furnished +
  swimming_pool, data = data_2018)
summary(reg_2018)

In [None]:
reg_2019 <- lm(price_real_month ~ area_m2 +
  bedrooms +
  suite +
  bathrooms +
  garage +
  condo_real +
  metro_dist_km +
  gym +
  field_quadra +
  elevator +
  furnished +
  swimming_pool, data = data_2019)
summary(reg_2019)

In [None]:
reg_2020 <- lm(price_real_month ~ area_m2 +
  bedrooms +
  suite +
  bathrooms +
  garage +
  condo_real +
  metro_dist_km +
  gym +
  field_quadra +
  elevator +
  furnished +
  swimming_pool, data = data_2020)
summary(reg_2020)

In [None]:
reg_2021 <- lm(price_real_month ~ area_m2 +
  bedrooms +
  suite +
  bathrooms +
  garage +
  condo_real +
  metro_dist_km +
  gym +
  field_quadra +
  elevator +
  furnished +
  swimming_pool, data = data_2021)
summary(reg_2021)

In [None]:
reg_2022 <- lm(price_real_month ~ area_m2 +
  bedrooms +
  suite +
  bathrooms +
  garage +
  condo_real +
  metro_dist_km +
  gym +
  field_quadra +
  elevator +
  furnished +
  swimming_pool, data = data_2022)
summary(reg_2022)

### Plot graphs removing outlier using IQR to do it

In [None]:
# Calculate the interquartile range (IQR) for both area_m2 and price_real_month
data <- data_rent
Q1_area <- quantile(data$area_m2, 0.25)
Q3_area <- quantile(data$area_m2, 0.75)
IQR_area <- Q3_area - Q1_area

Q1_price <- quantile(data$price_real_month, 0.25)
Q3_price <- quantile(data$price_real_month, 0.75)
IQR_price <- Q3_price - Q1_price

# Define the upper and lower bounds for outliers
upper_bound_area <- Q3_area + 1.5 * IQR_area
lower_bound_area <- Q1_area - 1.5 * IQR_area

upper_bound_price <- Q3_price + 1.5 * IQR_price
lower_bound_price <- Q1_price - 1.5 * IQR_price

# Filter out the outliers in both area_m2 and price_real_month
data_filtered <- subset(data, area_m2 >= lower_bound_area &
  area_m2 <= upper_bound_area &
  price_real_month >= lower_bound_price &
  price_real_month <= upper_bound_price)

# Set the limits for the x-axis and y-axis based on filtered data
x_min <- min(data_filtered$area_m2)
x_max <- max(data_filtered$area_m2)
y_min <- min(data_filtered$price_real_month)
y_max <- max(data_filtered$price_real_month)

# Generate the plot with filtered data and adjusted x-axis and y-axis limits
ggplot(data_filtered, aes(x = area_m2, y = price_real_month)) +
  geom_point() +
  geom_smooth(method = "auto", se = FALSE) +
  labs(title = "Scatterplot with Automatic Trend Curve",
       x = "Area (m^2)",
       y = "Price (R$ per month)") +
  xlim(x_min, x_max) +
  ylim(y_min, y_max)