# Problem Statement

**PROJECTO 1** <br>
**Análisis exploratorio y modelado predictivo de precios de viviendas en Barcelona usando Python y SQL**

## Objective
Desarrollar un análisis completo y un modelo predictivo para los precios de viviendas en Barcelona, utilizando datos extraídos del portal Fotocasa. El objetivo es aplicar técnicas de extracción, manipulación y análisis de datos, así como algoritmos de Machine Learning, para predecir los precios de las viviendas en función de diversas características.

## Data Description
- **price**: The price of the real-state.
- **rooms**: Number of rooms.
- **bathroom**: Number of bathrooms.
- **lift**: whether a building has an elevator (also known as a lift in some regions) or not
- **terrace**: If it has a terrace or not.
- **square_meters**: Number of square meters.
- **real_state**: Kind of real-state.
- **neighborhood**: Neighborhood
- **square_meters_price**: Price of the square meter

## Importing necessary libraries

In [826]:
import pandas as pd
import numpy as np

# To help with data visualization
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # data visualization
%matplotlib inline
sns.set_style('whitegrid') # set style for visualization

# To supress warnings
import warnings # ignore warnings
warnings.filterwarnings('ignore')

#normalizing
from sklearn.preprocessing import MinMaxScaler, StandardScaler # to scale the data

# modeling
import statsmodels.api as sm # adding a constant to the independent variables
from sklearn.model_selection import train_test_split # splitting data in train and test sets
from statsmodels.stats.outliers_influence import variance_inflation_factor #To check multicollinearity

## Loading the Dataset

In [427]:
df=pd.read_csv('Barcelona_Fotocasa_HousingPrices.csv')

# Data Overview

In [None]:
df.head() # preview a sample first 5 rows

In [None]:
df.tail() # preview a sample last 5 rows

In [None]:
df.sample(20) # preview a sample random n rows

In [None]:
print("There are", df.shape[0], 'rows and', df.shape[1], "columns.") # number of observations and features


In [None]:
df.dtypes # data types

In [None]:
df.info()

In [None]:
df.describe(include="all").T # statistical summary of the data.

In [None]:
# Uniques
df.nunique() # Checking for number of variations in the data


In [None]:
for i in df.columns: # Checking uniques
    print (i,": ",df[i].unique())

In [None]:
# Uniques
cat_cols = df.select_dtypes(include=['category', 'object','bool']).columns.tolist()
for column in cat_cols:
    print(df[column].value_counts())
    print("-" * 50)


In [None]:
# Duplicates
df.duplicated().sum() # Checking for duplicate entries in the data

# Notes on Data Overview

- There are 8188 rows and 10 columns.
- The variable 'Unnamed' represent index and should be deleted from data
- Data types are aligned with information
- There is missing data (NaN) on variable 'real_state'. To be replaced by "unknown"
- There are four types of real states being the most common "flat"
- Most units do not have terrace
- Most units do have lift
- The neighborhood with largest unit count is "Eixample"
- Units size goes from 10m2 to 679m2, with a mean of 84.61m2
- Units prices goes from 320EUR to 15000EUR/month, with mean of 1444EUR/month
- price range is assumed referred to monthly rent, so considered as EUR per month
- Units prices by square meter goes from 4.9EUR/m2/month to 186EUR/m2/month, with mean of 17.7EUR/m2/month
- There are units listed with cero rooms
- Target variable for modeling is "priceS"

# Exploratory Data Analysis (EDA)

## EDA Functions

In [439]:
def univariate_numerical(data):
    '''
    Function to generate two plots for each numerical variable
    Histplot for variable distribution
    Boxplot for statistical summary 
    '''
    # Select numerical columns
    numerical_cols = data.select_dtypes(include=[np.number]).columns
    
    # Determine the number of rows and columns
    num_vars = len(numerical_cols)
    num_cols = 4
    num_rows = int(np.ceil(num_vars * 2 / num_cols))
    
    # Create a figure with the specified size
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(5*num_cols, num_rows * 5))
    
    # Flatten the axes array for easy iteration
    axes = axes.flatten()
    
    # Plot each variable with a histplot and a boxplot
    for i, col in enumerate(numerical_cols):
        mean_value = data[col].mean()
        
        # Histplot with KDE
        sns.histplot(data[col], kde=True, ax=axes[i*2])
        axes[i*2].axvline(mean_value, color='r', linestyle='--')
        axes[i*2].set_title(f'Distribution of {col}')
        axes[i*2].text(mean_value, axes[i*2].get_ylim()[1]*0.8, f'Mean: {mean_value:.2f}', color='r', va='baseline', ha='left',rotation=90)
        
        # Boxplot
        sns.boxplot(y=data[col], ax=axes[i*2 + 1])
        axes[i*2 + 1].axhline(mean_value, color='r', linestyle='--')
        axes[i*2 + 1].set_title(f'Boxplot of {col}')
        axes[i*2 + 1].text(axes[i*2 + 1].get_xlim()[1]*0.8, mean_value, f'mean: {mean_value:.2f}', color='r', va='baseline', ha='right')
    
    # Hide any remaining empty subplots
    for j in range(num_vars * 2, len(axes)):
        fig.delaxes(axes[j])
    
    # Adjust layout
    plt.tight_layout()
    plt.show()

In [440]:
def univariate_categorical(data):
    '''
    Function to generate countplot for each categorical variable
    Labeled with count and percentage
    '''
    # List of categorical columns
    categorical_columns = data.select_dtypes(include=['object', 'category']).columns.tolist()
    
    # Number of columns in the grid
    num_cols = 4
    
    # Calculate the number of rows needed
    num_rows = (len(categorical_columns) + num_cols - 1) // num_cols
    
    # Create the grid
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(5*num_cols, num_rows * 5), constrained_layout=True)
    axes = axes.flatten()
    
    # Plot each countplot in the grid
    for i, col in enumerate(categorical_columns):
        ax = axes[i]
        plot = sns.countplot(x=col, data=data, order=data[col].value_counts().index, ax=ax)
        ax.set_title(f'Count of {col}')
           
        # Add total count and percentage annotations
        total = len(data)
        for p in plot.patches:
            height = p.get_height()
            percentage = f'{(height / total * 100):.1f}%'
            plot.text(x=p.get_x() + p.get_width() / 2,
                      y=height + 2,
                      s=f'{height:.0f}\n({percentage})',
                      ha='center')
        
        # Limit x-axis labels to avoid overlap
        ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
    
    # Remove any empty subplots
    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])
    
    # Show the plot
    plt.show()


In [442]:
# Function to plot crosstab with labels
def plot_crosstab_bar_count(df, var_interest):
    '''
    Function to create a barplot of crosstab of the variable of interest vs each of the rest of categorical variables
    Labeled with counts
    '''
    # Extract categorical columns excluding the variable of interest
    cat_cols = df.select_dtypes(include=['category', 'object','bool']).columns.tolist()
    cat_cols.remove(var_interest)
    
    # Determine the grid size
    num_vars = len(cat_cols)
    num_cols = 3  # Number of columns in the grid
    num_rows = (num_vars // num_cols) + int(num_vars % num_cols > 0)

    # Create a grid of subplots
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(5*num_cols, num_rows * 5), constrained_layout=True)
    axes = axes.flatten()  # Flatten the axes array for easy iteration

    for i, col in enumerate(cat_cols):
        # Create a crosstab
        crosstab = pd.crosstab(df[col], df[var_interest])
        
        # Plot the crosstab as a bar plot
        crosstab.plot(kind='bar', stacked=True, ax=axes[i])
        
        # Annotate counts in the middle of each bar section
        for bar in axes[i].patches:
            height = bar.get_height()
            if height > 0:
                axes[i].annotate(f'{int(height)}', 
                                 (bar.get_x() + bar.get_width() / 2, bar.get_y() + height / 2),
                                 ha='center', va='center', fontsize=10, color='black')
        
        # Add total labels at the top of each bar
        totals = crosstab.sum(axis=1)
        for j, total in enumerate(totals):
            axes[i].annotate(f'Total: {total}', 
                             (j, totals[j]), 
                             ha='center', va='bottom', weight='bold')

    # Hide any remaining empty subplots
    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])

    plt.tight_layout()
    plt.show()

# Usage
#plot_crosstab_bar_count(df, var_interest='var_interest')

In [548]:
def plot_crosstab_heat_perc(df, var_interest,df_name="DataFrame"):
    '''
    Function to create a heatmap of crosstab of the variable of interest vs each of the rest of catagorical variables
    Labeled with counts, percentage by row, percentage by column
    '''
    # Extract categorical columns excluding the variable of interest
    cat_cols = df.select_dtypes(include=['category', 'object']).columns.tolist()
    cat_cols.remove(var_interest)
    
    # Determine the grid size
    num_vars = len(cat_cols)
    num_cols = 3  # Number of columns in the grid
    num_rows = (num_vars // num_cols) + int(num_vars % num_cols > 0)

    # Create a grid of subplots
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(6*num_cols, num_rows * 6))
    axes = axes.flatten()  # Flatten the axes array for easy iteration

    for i, col in enumerate(cat_cols):
        # Create crosstabs
        crosstab = pd.crosstab(df[col], df[var_interest])
        crosstab_perc_row = crosstab.div(crosstab.sum(axis=1), axis=0) * 100
        crosstab_perc_col = crosstab.div(crosstab.sum(axis=0), axis=1) * 100

        # Combine counts with percentages
        crosstab_combined = crosstab.astype(str) + "\n" + \
                            crosstab_perc_row.round(2).astype(str) + "%" + "\n" + \
                            crosstab_perc_col.round(2).astype(str) + "%"

        # Plot the crosstab as a heatmap
        sns.heatmap(crosstab, annot=crosstab_combined, fmt='', cmap='Blues', ax=axes[i], cbar=False, annot_kws={"size": 8})
        axes[i].set_title(f'Crosstab of {col} and {var_interest} - {df_name}', fontsize=12)

    # Hide any remaining empty subplots
    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])

    # Adjust layout to prevent label overlapping
    plt.subplots_adjust(hspace=0.4, wspace=0.4)  # Add more space between subplots
    plt.tight_layout()
    plt.show()
    
# Usage
#plot_crosstab_heat_perc(df, var_interest='var_interest')

In [532]:
def boxplot_by_group(df, group, var, outliers, df_name="DataFrame"):
    '''
    boxplot for a numerical variable of interest vs a categorical variable
    with or without outliers
    includes data mean and mean by category
    '''
    # Calculate the average for the variable
    var_avg = df[var].mean()
    
    # Calculate variable mean per group
    var_means = df.groupby(group)[var].mean()
    
    # Sort by means and get the sorted order
    var_sorted = var_means.sort_values(ascending=False).index
    
    # Reorder the DataFrame by the sorted group
    df[group] = pd.Categorical(df[group], categories=var_sorted, ordered=True)
    
    # Create the boxplot with the reordered sectors
    ax = sns.boxplot(data=df, x=group, y=var, order=var_sorted, showfliers=outliers)
    
    # Add horizontal line for average variable value
    plt.axhline(var_avg, color='red', linestyle='--', label=f'Avg {var}: {var_avg:.2f}')
    
    # Scatter plot for means
    x_positions = range(len(var_means.sort_values(ascending=False)))
    plt.scatter(x=x_positions, y=var_means.sort_values(ascending=False), color='red', label='Mean', zorder=5)
    
    # Add labels to each red dot with the mean value
    for i, mean in enumerate(var_means.sort_values(ascending=False)):
        plt.text(i, mean, f'{mean:.2f}', color='red', ha='center', va='bottom')
    
    # Rotate x-axis labels
    plt.xticks(ticks=x_positions, labels=var_means.sort_values(ascending=False).index, rotation=90)
    
    # Add a legend
    plt.legend()
    plt.xlabel('')  # Remove x-axis title
    
    # Add plot title with DataFrame name
    plt.title(f'Boxplot of {var} by {group} - {df_name}')
    
    # Adjust layout
    plt.tight_layout()
    
    # Display the plot
    #plt.show()


**Functions:**
- univariate_numerical(data): Function to generate two plots for each numerical variable. Histplot for variable distribution. Boxplot for statistical summary
- univariate_categorical(data): Function to generate countplot for each categorical variable. Labeled with count and percentage
- plot_crosstab_bar_count(df, var_interest): Function to create a barplot of crosstab of the variable of interest vs each of the rest of categorical variables. Labeled with counts
- plot_crosstab_heat_perc(df, var_interest): Function to create a heatmap of crosstab of the variable of interest vs each of the rest of catagorical variables. Labeled with counts, percentage by row, percentage by column
- boxplot_by_group(df, group, var, outliers): boxplot for a numerical variable of interest vs a categorical variable. with or without outliers. includes data mean and mean by category

## Univariate Analysis

In [None]:
univariate_numerical(df)

In [None]:
univariate_categorical(df);

In [None]:
df.loc[(df['real_state']=="flat")].describe().T

In [None]:
df.loc[(df['neighborhood']=="Eixample")].describe().T

## Bivariate Analysis

In [None]:
# Create a PairGrid
g = sns.PairGrid(df, corner=True)

# Map different plots to the grid
g.map_lower(sns.scatterplot)
g.map_diag(sns.histplot,kde=True)

# Show the plot
plt.show()

In [479]:
# Calculate correlation matrix
corr_matrix = df.select_dtypes(include=np.number).corr()

In [None]:
# Plot correlation matrix as heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix Heatmap')
plt.show()

In [481]:
# Display the sorted correlation table
corr_unstacked = corr_matrix.unstack() # Unstack the correlation matrix
corr_unstacked = corr_unstacked.reset_index() # Reset the index to get 'variable1' and 'variable2' as columns
corr_unstacked.columns = ['variable1', 'variable2', 'correlation']# Rename the columns for better understanding
corr_unstacked = corr_unstacked[corr_unstacked['variable1'] != corr_unstacked['variable2']] # Remove self-correlations by filtering out rows where variable1 == variable2
corr_unstacked = corr_unstacked.drop_duplicates(subset=['correlation']) # Drop duplicates to keep only one entry per variable pair
sorted_corr = corr_unstacked.sort_values(by='correlation', key=abs, ascending=False) # Sort the DataFrame by the absolute value of correlation
#sorted_corr # Display the sorted correlation table

In [482]:
# Define a function to categorize the correlation level
def categorize_correlation(correlation):
    abs_corr = abs(correlation) * 100  # Convert to percentage for easier comparison
    if abs_corr < 30:
        return 'Negligible'
    elif 30 <= abs_corr < 50:
        return 'Low'
    elif 50 <= abs_corr < 70:
        return 'Moderate'
    elif 70 <= abs_corr < 90:
        return 'High'
    else:
        return 'Very High'


In [None]:
# Apply the function to create the corr_lvl column
sorted_corr['corr_lvl'] = sorted_corr['correlation'].apply(categorize_correlation)
sorted_corr['corr_lvl'].value_counts()


In [None]:
sorted_corr

In [None]:
df.columns

In [487]:
# check unique rooms-bathroom combinations
unique_combinations=df.groupby(['rooms', 'bathroom']).size().reset_index(name='count')
unique_combinations_sorted=unique_combinations.sort_values(by='count',ascending=False)

In [488]:
# Calculate cumulative sum of counts
unique_combinations_sorted['cum_sum'] = unique_combinations_sorted['count'].cumsum()

In [None]:
# Calculate the cumulative percentage
unique_combinations_sorted['perc'] = unique_combinations_sorted['count'] / unique_combinations_sorted['count'].sum() * 100
unique_combinations_sorted['cum_perc'] = unique_combinations_sorted['cum_sum'] / unique_combinations_sorted['count'].sum() * 100
unique_combinations_sorted.head(10)

In [None]:
df_pop = df.merge(unique_combinations_sorted.head(10), on=['rooms', 'bathroom'])
df_pop

In [None]:
print(df.shape,df_pop.shape)

In [531]:
# Define the function to create and display side-by-side boxplots
def side_by_side_boxplot(df1, df2, group, var, outliers, title1, title2):
    fig, axes = plt.subplots(1, 2, figsize=(18, 6), sharey=True)
    
    # First subplot for df1
    plt.sca(axes[0])
    boxplot_by_group(df1, group, var, outliers, title1)
    
    # Second subplot for df2
    plt.sca(axes[1])
    boxplot_by_group(df2, group, var, outliers, title2)
    
    # Show both plots after setup
    plt.show()

# Usage
#side_by_side_boxplot(df, df_pop, 'neighborhood', 'price', True, "All units (show outliers)", "Popular units (show outliers)")

In [None]:
side_by_side_boxplot(df, df_pop, 'neighborhood', 'price', True, "All units (show outliers)", "Popular units (show outliers)")

In [None]:
side_by_side_boxplot(df, df_pop, 'neighborhood', 'price', False, "All units (without outliers)", "Popular units (without outliers)")

In [None]:
side_by_side_boxplot(df, df_pop, 'neighborhood', 'square_meters_price', False, "All units (without outliers)", "Popular units (without outliers)")

In [None]:
side_by_side_boxplot(df, df_pop, 'neighborhood', 'square_meters', False, "All units (without outliers)", "Popular units (without outliers)")

In [None]:
side_by_side_boxplot(df, df_pop, 'real_state' , 'price', False, "All units (without outliers)", "Popular units (without outliers)")

In [None]:
side_by_side_boxplot(df, df_pop, 'real_state' , 'square_meters_price', False, "All units (without outliers)", "Popular units (without outliers)")

In [None]:
side_by_side_boxplot(df, df_pop, 'real_state' , 'square_meters', False, "All units (without outliers)", "Popular units (without outliers)")

In [None]:
plot_crosstab_heat_perc(df, var_interest='real_state',df_name="All units")

In [None]:
plot_crosstab_heat_perc(df_pop, var_interest='real_state',df_name="Popular units")

In [None]:
plot_crosstab_bar_count(data, var_interest='lift')

In [None]:
plot_crosstab_bar_count(data, var_interest='terrace')

# Notes on Exploratory Data Analysis



**Functions:**
- univariate_numerical(data): Function to generate two plots for each numerical variable. Histplot for variable distribution. Boxplot for statistical summary
- univariate_categorical(data): Function to generate countplot for each categorical variable. Labeled with count and percentage
- plot_crosstab_bar_count(df, var_interest): Function to create a barplot of crosstab of the variable of interest vs each of the rest of categorical variables. Labeled with counts
- plot_crosstab_heat_perc(df, var_interest): Function to create a heatmap of crosstab of the variable of interest vs each of the rest of catagorical variables. Labeled with counts, percentage by row, percentage by column
- boxplot_by_group(df, group, var, outliers): boxplot for a numerical variable of interest vs a categorical variable. with or without outliers. includes data mean and mean by category

**Univariate Analysis**
- The variables "Unnamed: 0" have a uniform distribution
- The numerical variables have a shift to the right
- The categorical variables are not balanced, with 79% of properties as "apartments" and 78% of units concentrated in 50% of the sample neighbourhoods
- 75% of the apartment units have up to 3 bedrooms and up to 2 bathrooms with an average size of 85m2.
- 75% of the units in Eixample have up to 3 bedrooms and up to 2 bathrooms with an average size of 87m2.


**Bivariate Analysis**
- 'square_meters' has a positive correlation with 'price', 'rooms' and 'bathrooms'
- 'square_meters_price' has a negative correlation with 'square_meters', 'rooms' and 'bathrooms'
- There are only one couple of variables with high correlation: bathroom-square_meters (0.75)
- The most popular unit configuration in the dataset is 2 bedrooms and 1 bathroom with 1836 units (21% of all units).
- Other popular configurations are 1-1 (18%), 3-2 (14%), 3-1 (13%), 2-2 (9%) and 4-2 (8%)
- These six most popular unit configurations represent 86% of all units
- The "df_pop" data frame includes the most popular units in terms of bedroom/bathroom configuration, representing 94% of the samples
- Sarrià-Sant Gervasi, Les Corts, Eixample and San Martí are the most expensive neighbourhoods with average prices above the dataset average.
- Sants-Montjuïc, Horta-Guinardó, Sant Andreu and Nou Barris are the cheapest neighbourhoods with average prices below the dataset average.
- When comparing the price per square metre, Ciutat Vella and Eixample are the most expensive neighbourhoods. - If we compare square meters, Ciutat Vella is the second lowest and Eixample the third
- From the perspective of price per square meter, the most attractive neighborhood according to this data could be Les Corts, with an average surface area of ​​89.79 m2 above the average (78.67 m2) and a price per square meter of 15.85 below the average (17.79)
- From the perspective of price per square meter, the most attractive type of unit according to this data could be the apartment, with an average surface area of ​​80 m2 above the average (78.67 m2) and a price per square meter of 15.76 below the average (17.79)
- There are 1,777 flats in Eixample, being the most popular unit type and neighborhood combination, with 79.68% of the units in Eixample being flats, and 28.9% of the flats are in Eixample. - In Les Courts there are only 398 flats, which makes it far from the most popular type of housing and neighbourhood combination, although 87.67% of the dwellings in Les Courts are flats, only 6.47% of the flats are in Les Courts.
- Most types of units have a lift, in the case of flats the proportion is 74.12%
- Units with a terrace on the other hand, seem to be rare and very few have one

# Data Preprocesing

- Missing value treatment
- Feature engineering
- Outlier detection and treatment
- Any other preprocessing steps

In [745]:
df2=df.copy() # Data preprocesing over a copy of original dataset

In [None]:
df2.isna().sum() # missing values per feature

In [None]:
df2['real_state'].value_counts(dropna=False)

In [748]:
# Add 'unknown' to categories
df2['real_state'] = df2['real_state'].cat.add_categories("unknown")

# Replace NaN values with 'unknown'
df2['real_state'] = df2['real_state'].fillna("unknown")


In [None]:
df2.isna().sum() # missing values per feature

In [None]:
df2['real_state'].value_counts()

In [754]:
df2.drop(['Unnamed: 0'], axis=1, inplace=True)

In [756]:
# function to check for outliers
def count_outliers(df):
    outlier_count=0
    for column in df.select_dtypes(include=np.number).columns:
        outliers=len(df[(df[column] < df[column].quantile(0.25)-1.5*(df[column].quantile(0.75)-df[column].quantile(0.25))) | (df[column] > df[column].quantile(0.75)+1.5*(df[column].quantile(0.75)-df[column].quantile(0.25)))][column])
        print(f'{column}: {outliers} outliers ({outliers/df.shape[0]*100:.2f}%)')
        outlier_count+= outliers
    return outlier_count

In [None]:
count_outliers(df2)

In [766]:
# Calculate z-scores for only numeric columns without creating dummies
outlier_mask = (np.abs(df2.select_dtypes(include=np.number).apply(zscore)) < 3).all(axis=1)

# Filter the DataFrame based on the outlier mask and retain the original column structure
df3 = df2[outlier_mask]

In [None]:
count_outliers(df3)

In [None]:
df3.shape

In [770]:
df4=df3.copy()
for column in df4.select_dtypes(include=np.number).columns:
    df4[column]=np.clip(df4[column], df4[column].quantile(0.25)-1.5*(df4[column].quantile(0.75)-df4[column].quantile(0.25)), df4[column].quantile(0.75)+1.5*(df4[column].quantile(0.75)-df4[column].quantile(0.25)))

In [None]:
count_outliers(df4)

In [None]:
df4.shape

In [773]:
#creating dumies
df5 = pd.get_dummies(df4, columns=['real_state','neighborhood'], drop_first=True)

In [None]:
df5.shape

In [None]:
df5.info()

In [775]:
# Convert boolean to numeric
cols = df5.select_dtypes(['bool'])
for i in cols.columns:
    df5[i] = df5[i].astype('int')

In [None]:
df5.head()

In [778]:
# Apply Min-Max Scaling
min_max_scaler = MinMaxScaler()
df5mm = pd.DataFrame(min_max_scaler.fit_transform(df5), columns=df5.columns)

In [None]:
df5mm.head()

In [808]:
data=df5mm.copy()

In [809]:
datapop = data.merge(unique_combinations_sorted.head(10), on=['rooms', 'bathroom'])


In [None]:
unique_combinations_sorted.head(10)

# Notes on Data Preprocesing


- Preprocessed data on a copy of the original dataset named df2
- Created a new category "unknown" in the variable 'real_state' replacing NaN
- Removed the variable "Unnamed: 0" which had no value for modeling
- There are outliers in all variables. df2.shape:(8188, 9)
- Applied the Z-score method, which removes outliers with more than 3 standard deviations. Some variables with a relevant percentage of outliers still remain. df3.shape:(7742, 9)
- Limited outliers to respective whisker boundaries. df4.shape:(7742, 9)
- Created dummy variables for variables 'real_state' and 'neighborhood'. df5.shape:(7742, 20)
- Boolean variables were converted to numeric
- Min-Max scaling was applied. The dataset has features with different scales, normalization ensures that no feature dominates the learning process.

# EDA (pre-modeling)

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.describe().T

In [None]:
univariate_numerical(data)

In [797]:
# Calculate correlation matrix
corr_matrix = data.select_dtypes(include=np.number).corr()

In [None]:
# Plot correlation matrix as heatmap
plt.figure(figsize=(14, 12))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix Heatmap')
plt.show()

In [None]:
# Display the sorted correlation table
corr_unstacked = corr_matrix.unstack() # Unstack the correlation matrix
corr_unstacked = corr_unstacked.reset_index() # Reset the index to get 'variable1' and 'variable2' as columns
corr_unstacked.columns = ['variable1', 'variable2', 'correlation']# Rename the columns for better understanding
corr_unstacked = corr_unstacked[corr_unstacked['variable1'] != corr_unstacked['variable2']] # Remove self-correlations by filtering out rows where variable1 == variable2
corr_unstacked = corr_unstacked.drop_duplicates(subset=['correlation']) # Drop duplicates to keep only one entry per variable pair
sorted_corr = corr_unstacked.sort_values(by='correlation', key=abs, ascending=False) # Sort the DataFrame by the absolute value of correlation
sorted_corr # Display the sorted correlation table

In [None]:
# Apply the function to create the corr_lvl column
sorted_corr['corr_lvl'] = sorted_corr['correlation'].apply(categorize_correlation)
sorted_corr['corr_lvl'].value_counts()

In [None]:
sorted_corr

# Notes on EDA (pre-modeling)

- The shape of the data for modeling (7742 , 20) does not account for outliers
- The data for modeling have no missing values ​​and all variables are numeric and scaled
- Low correlation between variables, with only a couple of variables having a high correlation (bathroom and square footage)

# Modeling

## Preparing data for modeling

In [None]:
# specifying the independent  and dependent variables
X = data.drop(["price"], axis=1)
Y = data["price"]

# adding a constant to the independent variables
X = sm.add_constant(X)

# splitting data in train and test sets
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=1)

# Checking training and test sets.
print("Shape of Training set : ", x_train.shape)
print("Shape of test set : ", x_test.shape)


## Model Building

In [831]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


In [832]:

# Define a function to evaluate and return the model's metrics
def evaluate_model(model, x_test, y_test):
    y_pred = model.predict(x_test)
    metrics = {
        "MAE": mean_absolute_error(y_test, y_pred),
        "MSE": mean_squared_error(y_test, y_pred),
        "RMSE": np.sqrt(mean_squared_error(y_test, y_pred)),
        "R2 Score": r2_score(y_test, y_pred)
    }
    return metrics


In [843]:

# Initialize an empty DataFrame to store results
results_df = pd.DataFrame(columns=["Model", "MAE", "MSE", "RMSE", "R2 Score"])


In [844]:

# Dictionary of regression models to try
regression_models = {
    "Linear Regression": LinearRegression(),
    "Lasso Regression": Lasso(),
    "Ridge Regression": Ridge(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest": RandomForestRegressor(),
    "K-Nearest Neighbors": KNeighborsRegressor(),
    "Support Vector Regressor": SVR()}


In [845]:

# Loop through each model, train it, evaluate it, and store results
for model_name, model in regression_models.items():
    model.fit(x_train, y_train)
    metrics = evaluate_model(model, x_test, y_test)
    metrics["Model"] = model_name  # Add model name for reference
    results_df = pd.concat([results_df, pd.DataFrame([metrics])], ignore_index=True)


In [852]:

# Display the results DataFrame
results_df.sort_values(by="R2 Score", ascending=False)

Unnamed: 0,Model,MAE,MSE,RMSE,R2 Score
4,Random Forest,0.007376,0.000563,0.023725,0.990614
3,Decision Tree,0.01045,0.001089,0.033001,0.98184
6,Support Vector Regressor,0.041981,0.003182,0.056408,0.946942
0,Linear Regression,0.042237,0.004036,0.063528,0.932704
2,Ridge Regression,0.042269,0.004047,0.063616,0.932517
5,K-Nearest Neighbors,0.059409,0.009173,0.095774,0.847047
1,Lasso Regression,0.196899,0.059972,0.244891,-2.6e-05


In [853]:
results_df.sort_values(by="MAE")

Unnamed: 0,Model,MAE,MSE,RMSE,R2 Score
4,Random Forest,0.007376,0.000563,0.023725,0.990614
3,Decision Tree,0.01045,0.001089,0.033001,0.98184
6,Support Vector Regressor,0.041981,0.003182,0.056408,0.946942
0,Linear Regression,0.042237,0.004036,0.063528,0.932704
2,Ridge Regression,0.042269,0.004047,0.063616,0.932517
5,K-Nearest Neighbors,0.059409,0.009173,0.095774,0.847047
1,Lasso Regression,0.196899,0.059972,0.244891,-2.6e-05


In [854]:
results_df.sort_values(by="MSE")

Unnamed: 0,Model,MAE,MSE,RMSE,R2 Score
4,Random Forest,0.007376,0.000563,0.023725,0.990614
3,Decision Tree,0.01045,0.001089,0.033001,0.98184
6,Support Vector Regressor,0.041981,0.003182,0.056408,0.946942
0,Linear Regression,0.042237,0.004036,0.063528,0.932704
2,Ridge Regression,0.042269,0.004047,0.063616,0.932517
5,K-Nearest Neighbors,0.059409,0.009173,0.095774,0.847047
1,Lasso Regression,0.196899,0.059972,0.244891,-2.6e-05


In [855]:
results_df.sort_values(by="RMSE")

Unnamed: 0,Model,MAE,MSE,RMSE,R2 Score
4,Random Forest,0.007376,0.000563,0.023725,0.990614
3,Decision Tree,0.01045,0.001089,0.033001,0.98184
6,Support Vector Regressor,0.041981,0.003182,0.056408,0.946942
0,Linear Regression,0.042237,0.004036,0.063528,0.932704
2,Ridge Regression,0.042269,0.004047,0.063616,0.932517
5,K-Nearest Neighbors,0.059409,0.009173,0.095774,0.847047
1,Lasso Regression,0.196899,0.059972,0.244891,-2.6e-05


# Notes on Model Building


- In order to make statistical inferences from a logistic regression model, it is important to ensure that there is no multicollinearity present in the data.
- Data split 70/30. Shape of Training set :  (5419, 20), Shape of test set :  (2323, 20)
- Based on the provided results, here are some conclusions you can draw about the performance of each model:
- Performance Metrics:
    - **MAE (Mean Absolute Error)**: Measures the average magnitude of errors in a set of predictions, without considering their direction.
    - **MSE (Mean Squared Error)**: Measures the average of the squares of the errors, giving more weight to larger errors.
    - **RMSE (Root Mean Squared Error)**: The square root of MSE, providing error in the same units as the target variable.
    - **R2 Score (Coefficient of Determination)**: Indicates how well the model's predictions approximate the real data points. A value closer to 1 indicates a better fit.
- **Random Forest**: **Best Performance**. It has the lowest MAE (0.007370), MSE (0.000556), and RMSE (0.023589), and the highest R2 Score (0.990722), indicating it is the most accurate model among the ones tested.
- **Decision Tree**: **Second Best**. It also performs very well with low MAE (0.010382), MSE (0.001028), and RMSE (0.032058), and a high R2 Score (0.982863).
- **Support Vector Regressor (SVR)**: **Good Performance**. It has a relatively low MAE (0.041981), MSE (0.003182), and RMSE (0.056408), with a high R2 Score (0.946942).
- **Linear Regression and Ridge Regression**: Similar Performance. Both have similar metrics with MAE around 0.042, MSE around 0.004, RMSE around 0.063, and R2 Score around 0.93, indicating decent performance.
- **K-Nearest Neighbors (KNN)**: Moderate Performance. It has higher MAE (0.059409), MSE (0.009173), and RMSE (0.095774), with a lower R2 Score (0.847047), indicating it is less accurate compared to the top models.
- **Lasso Regression**: Poor Performance: It has the highest MAE (0.196899), MSE (0.059972), and RMSE (0.244891), with a negative R2 Score (-0.000026), indicating it performs poorly on this dataset.
