# <center> 🧽 Data Cleaning Part 2 </center>

## Table of Contents

- [1. Introduction](#1-introduction)
- [2. Loading Data](#2-loading-data)
- [3. Exploratory Data Analysis](#3-exploratory-data-analysis)
- [4. Save the transformed data](#4-save-the-transformed-data)

## 1. Introduction
[Back to Table of Contents](#table-of-contents)

---

| ⚡ Description: Introducing the project ⚡ |
| :--------------------------- |

---

In this section, we shift our focus to cleaning and preparing the property listings for sale. While some cleaning steps such as handling missing values, feature extraction, and outlier treatment may overlap with those applied to rental listings, sale listings present unique challenges. 

To ensure the dataset is ready for analysis, we will:

- Identify and handle `POA` entries in the price column
- Impute missing values in key numerical columns (e.g., erf size, floor area)
- Standardize categorical values and fix any structural inconsistencies. 
- Remove or adjust outliers where appropriate

By the end of this section, the for-sale listings will be structured and cleaned similarly to the rental data, ready for unified or comparative analysis across different listings types. 

## 2. Loading Data
[Back to Table of Contents](#table-of-contents)

---
    
| ⚡ Description: Loading the data in our environment ⚡ |
| :--------------------------- |

---

### Import packages

In [1]:
# Import packages
import re
import warnings

import numpy as np
import pandas as pd

warnings.filterwarnings("ignore", category=FutureWarning)

In [2]:
# Load the dataset that contains the listings_for_sale
listings_for_sale = pd.read_csv("../data/raw/listings_for_sale_in_capital_cities.csv")
display("Listings for sale", listings_for_sale.head(), listings_for_sale.shape)

'Listings for sale'

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug
0,R 690 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Bhisho Central",3 Bedroom House for Sale in Bhisho Central,RE/MAX Border is proud to present this family ...,Property OverviewListing Number115871179Type o...,3-bedroom-house-for-sale-in-bhisho-central-115...
1,R 700 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu North",3 Bedroom House for Sale in Tyu Tyu North,This spacious property offers a well-designed ...,Property OverviewListing Number115765257Type o...,3-bedroom-house-for-sale-in-tyu-tyu-north-1157...
2,R 749 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu North",3 Bedroom House for Sale in Tyu Tyu North,Sizwe Mlungwana Properties is excited to prese...,Property OverviewListing Number115864665Type o...,3-bedroom-house-for-sale-in-tyu-tyu-north-1158...
3,R 799 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Bhisho Central",3 Bedroom House for Sale in Bhisho Central,Discover this well located 3 bedroom home.Perf...,Property OverviewListing Number115811196Type o...,3-bedroom-house-for-sale-in-bhisho-central-115...
4,R 695 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu",3 Bedroom House for Sale in Tyu Tyu,Presenting this beautiful house to the market....,Property OverviewListing Number115837796Type o...,3-bedroom-house-for-sale-in-tyu-tyu-115837796


(40006, 6)

## 3. Exploratory Data Analysis
[Back to Table of Contents](#table-of-contents)

---
    
| ⚡ Description: Exploring the data ⚡ |
| :--------------------------- |

---

In [3]:
# Get information about the listings_for_sale document
listings_for_sale.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40006 entries, 0 to 40005
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   price                 40002 non-null  object
 1   location              33028 non-null  object
 2   property_title        40002 non-null  object
 3   property_description  40001 non-null  object
 4   property_features     40002 non-null  object
 5   listing_slug          40006 non-null  object
dtypes: object(6)
memory usage: 1.8+ MB


In [4]:
# See the number of null values in each column
listings_for_sale.isnull().sum()

price                      4
location                6978
property_title             4
property_description       5
property_features          4
listing_slug               0
dtype: int64

In [5]:
# View rows where the property_title is null
listings_for_sale[listings_for_sale['property_title'].isna()]

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug
14833,,,,,,property-for-sale-in-johannesburg-c100?Page=180
15027,,,,,,property-for-sale-in-johannesburg-c100?Page=189
17224,,,,,,property-for-sale-in-johannesburg-c100?Page=288
24804,,,,,,property-for-sale-in-johannesburg-c100?Page=630


In [6]:
# Drop listings where the property_title is null
listings_for_sale = listings_for_sale.dropna(subset=['property_title']).reset_index(drop=True)

In [7]:
# Get the total number of duplicate rows
listings_for_sale_duplicates = listings_for_sale[listings_for_sale.duplicated()]
num_listings_for_sale_duplicates = len(listings_for_sale_duplicates)
print(f"There are {num_listings_for_sale_duplicates} duplicates in the listings_for_sale dataset ")

There are 2072 duplicates in the listings_for_sale dataset 


In [8]:
# Drop duplicate rows and keep the first occurrence
listings_for_sale = listings_for_sale.drop_duplicates().reset_index(drop=True)

#### Property description

The `property_description` column contains free-text descriptions written by agents or property owners to highlight features and benefits of the listing. While this unstructured text can provide valuable insights, it is often subjective, inconsistent, and difficult to analyse at scale without advanced natural language processing. Since our current analysis focuses on structured, comparable features across listings, we will drop the property_description column.

In [9]:
# Drop the `property_description` column
listings_for_sale.drop(columns=['property_description'], inplace=True)

#### Property features extraction

The `property_features` column has a section labelled *Points of Interest* which highlights nearby amenities and infrastructure such as schools, hospitals, shopping centres, public transport stops and recreational facilities. It provides a quick view of the property's surroundings and helps to assess its suitability for lifestyle preferences, daily commuting, and community accessibility. 

We remove this section to reduce noise and ensure consistency, as it contains unstructured text, and it may not be directly useful for aggregation without additional geospatial processing. 

In [10]:
# Remove any information following the text 'Points of Interest' in the 'property_features' column
listings_for_sale["property_features"] = listings_for_sale["property_features"].str.replace(
    r'Points of Interest.*', '', regex=True
)

# Remove any empty spaces for easier analysis
listings_for_sale["property_features"] = listings_for_sale["property_features"].str.strip()

The `property_features` column has a wealth of information about each listing, ranging from the listing number, the type of property, room information, deposit information, occupation date, etc. However, for the purpose of this analysis, we will only extract a subset of relevant features that are most structured and analytically useful where we can reliably quantify and compare across properties. Once the necessary information has been extracted, we will drop the property_features column.

In [11]:
# We define the patterns of the features we want to extract
patterns = {
    'listing_number': r'Listing Number(\d+)',
    'property_type': r'Type of Property\s*(House|Apartment / Flat|Townhouse|Commercial|Industrial|Vacant Land|Farm)',
    'furnished': r'Furnished(Yes|No)',
    'pets_allowed': r'Pets Allowed(Yes|No)',
    'backup_water': r'Backup Water(Yes|No)',
    'bedrooms': r'Bedrooms?\s*(\d+)',
    'bathrooms': r'Bathrooms?\s*(\d+)',
    'kitchen': r'Kitchens?\s*(\d+)',
    'lounge': r'Lounges?\s*(\d+)',
    'dining_room': r'Dining Rooms?\s*(\d+)',
    'parking': r'Parking(\d+)',
    'garden': r'Gardens?\s*(\d+)',
    'pool': r'Pools?\s*(\d+)',
    'garage': r'Garages?\s*(\d+)',
    'erf_size (m²)': r'Erf Size\s*(\d+)\s*m²',
    'floor_area (m²)': r'Floor Area\s*(\d+)\s*m²'
}


# Function to extract features from text
def extract_features(text):
    return {
        field: (match := re.search(pattern, text)
                ) and match.group(1).strip() or None for field, pattern in patterns.items()
    }


# Apply extraction row-wise
extracted_df = listings_for_sale['property_features'].apply(
    extract_features).apply(pd.Series)

# Combine the original df with the extracted columns
results_df = pd.concat([listings_for_sale, extracted_df], axis=1)

# Drop the property_features after the extraction is complete
results_df.drop(columns=['property_features'], inplace=True)

# View results
results_df.head()

Unnamed: 0,price,location,property_title,listing_slug,listing_number,property_type,furnished,pets_allowed,backup_water,bedrooms,bathrooms,kitchen,lounge,dining_room,parking,garden,pool,garage,erf_size (m²),floor_area (m²)
0,R 690 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Bhisho Central",3 Bedroom House for Sale in Bhisho Central,3-bedroom-house-for-sale-in-bhisho-central-115...,115871179,House,,Yes,,3.0,1,1,1.0,1.0,3.0,,,,248.0,128.0
1,R 700 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu North",3 Bedroom House for Sale in Tyu Tyu North,3-bedroom-house-for-sale-in-tyu-tyu-north-1157...,115765257,House,,Yes,,,1,2,,,2.0,,,,280.0,278.0
2,R 749 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu North",3 Bedroom House for Sale in Tyu Tyu North,3-bedroom-house-for-sale-in-tyu-tyu-north-1158...,115864665,House,,,,3.0,1,1,1.0,,,,,,,
3,R 799 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Bhisho Central",3 Bedroom House for Sale in Bhisho Central,3-bedroom-house-for-sale-in-bhisho-central-115...,115811196,House,,,,3.0,1,1,,,,1.0,,,413.0,
4,R 695 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu",3 Bedroom House for Sale in Tyu Tyu,3-bedroom-house-for-sale-in-tyu-tyu-115837796,115837796,House,,,,3.0,2,1,1.0,,,,,,,


In [12]:
# Print the number of null values in the `bedrooms` column
results_df["bedrooms"].isnull().sum()

np.int64(17784)

The `property_title` column often contains structured phrases that include the number of bedrooms for residential properties such as apartments, houses and townhouses. This consistency allows for the extraction of bedroom information directly from the title using regular expressions, which can help fill in missing values in the `bedrooms` columns. 

In [13]:
def extract_bedrooms_from_property_title(title):
    """
    Extract the number of bedrooms from a property title string.
    This function searches for patterns like '2-bedroom', '3 bedroom', or '4bedroom' (case-insensitive) within the input title and returns the number of bedrooms as an integer.
    
    Args:
        title (str): The property title text to extract the bedroom count from

    Returns:
        int or None: The number of bedrooms if found, otherwise, None
    """
    if pd.isna(title):
        return None
    match = re.search(r'(\d+)[-\s]?bedroom', title.lower())
    return int(match.group(1)) if match else None


# Apply the function to fill in null values
results_df['bedrooms'] = results_df['bedrooms'].fillna(results_df.apply(lambda row: extract_bedrooms_from_property_title(row['property_title']) if pd.isna(row['bedrooms']) else row['bedrooms'], axis=1))


# Convert the column to an integer type
results_df['bedrooms'] = pd.to_numeric(results_df['bedrooms'], errors='coerce').astype('Int64')

In [14]:
# Print the number of null values in the `bedrooms` column
results_df["bedrooms"].isnull().sum()

np.int64(6321)

#### Suburb, city and province extraction

The `property_title` column follows a structure that includes the suburb where the property is located. We can use this to extract the suburb for each listing. 

In [15]:
# Extract the suburb from the property_title column
results_df['suburb'] = results_df['property_title'].str.extract(
    r'for Sale in (.+)', expand=False)

# Clean up possible trailing whitespaces
results_df['suburb'] = results_df['suburb'].str.strip()


# Drop the property_title column as we have extracted all the information we need
results_df.drop(columns=['property_title'], inplace=True)

# Print the first five rows of the table to see the results
results_df.head()

Unnamed: 0,price,location,listing_slug,listing_number,property_type,furnished,pets_allowed,backup_water,bedrooms,bathrooms,kitchen,lounge,dining_room,parking,garden,pool,garage,erf_size (m²),floor_area (m²),suburb
0,R 690 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Bhisho Central",3-bedroom-house-for-sale-in-bhisho-central-115...,115871179,House,,Yes,,3,1,1,1.0,1.0,3.0,,,,248.0,128.0,Bhisho Central
1,R 700 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu North",3-bedroom-house-for-sale-in-tyu-tyu-north-1157...,115765257,House,,Yes,,3,1,2,,,2.0,,,,280.0,278.0,Tyu Tyu North
2,R 749 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu North",3-bedroom-house-for-sale-in-tyu-tyu-north-1158...,115864665,House,,,,3,1,1,1.0,,,,,,,,Tyu Tyu North
3,R 799 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Bhisho Central",3-bedroom-house-for-sale-in-bhisho-central-115...,115811196,House,,,,3,1,1,,,,1.0,,,413.0,,Bhisho Central
4,R 695 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu",3-bedroom-house-for-sale-in-tyu-tyu-115837796,115837796,House,,,,3,2,1,1.0,,,,,,,,Tyu Tyu


We will then load a seperate **city_province** csv file that contains a mapping of cities to their matching provinces to validate and fill in missing province information in the main dataset. 

In [16]:
# Load the CSV that contains the capital_cities with their respective provinces and create a mapping
city_province = pd.read_csv("../data/raw/south_africa_provinces_capitals.csv")

# Crete a mapping from city to province
city_list = city_province['city'].dropna().str.strip().unique()

The `location` column contains inconsistent formats, where it lists both the suburb and city and other times provides a full address. To address this, we use the *city_list* mapping to extract the city by matching it against known entries. If no match is found, the value is replaced with **None**.

In [17]:
def extract_city_from_location(location, city_list):
    """
    Extract the city name from a location string based on a provided list of cities.
    
    This function checks whether any city from a given list appears (case-insensitive) in the provided location string. If a match is found, the corresponding city name (in original casing from the list) is returned. 
    
    Args:
        location (str): The text string containing location information.
        city_list (list of str): A list of known city names to match against the location.

    Returns:
        str or None: The matched city name if found in the location string, otherwise, None
    """
    if pd.isna(location):
        return None

    location_lower = location.lower()
    for city in city_list:
        if city.lower() in location_lower:
            return city
    return None

In [18]:
# Apply the extract_city_from_location function to extract the city from the location column
results_df['city'] = results_df['location'].apply(
    lambda loc: extract_city_from_location(loc, city_list))

# Drop the location column as it is redundant
results_df.drop(columns=["location"], inplace=True)

The `city` column contains null values because some entries in the `location` column are empty or incomplete. To address this, we create a mapping of known suburbs to their matching cities and use it to fill in the missing `city` column. 

In [19]:
# Create a mapping of suburb to known cities
suburb_to_city_map = results_df[results_df['city'].notna()].drop_duplicates(
    subset=['suburb', 'city']).set_index('suburb')['city'].to_dict()

# Fill missing city values based on the suburb
results_df['city'] = results_df.apply(lambda row: suburb_to_city_map.get(
    row['suburb'], row['city']) if pd.isna(row['city']) else row['city'], axis=1)

In [20]:
# Print out the number of null values within the `city` column
results_df['city'].isnull().sum()

np.int64(59)

There are 59 null rows in the city column which account for less than 0.1% of the total dataset. We remove these rows to maintain data quality given their small proportion. 

In [21]:
# Drop the city rows with null values
results_df = results_df.dropna(subset=['city'])

# Get the total number of null rows in the `city` column
results_df['city'].isnull().sum()

np.int64(0)

We will then merge the dataset with the **city_province.csv** file to assign the matching province to each city. 

In [22]:
# Merge the city with its matching province
results_df = results_df.merge(city_province, on='city', how='left')

The `listing_slug` column contains a URL-friendly string summarising key details about this listing, such as the number of bedrooms, property type, suburb, and a unique listing ID. This information is already available in other columns; we drop `listing_slug` to avoid redundancy. 

In [23]:
# Drop the listing slug column
results_df.drop(columns=['listing_slug'], inplace=True)

### Price column

The `price` column currently combines the price amount with the bond calculator text (e.g., R 690 000Bond CostBond....), which complicates aggregation and analysis. To resolve this, we extract the numeric value into a new `price (ZAR)` column, removing the currency symbol in the process. The cleaned data is then stored in a new DataFrame called **results_df**. 

In [24]:
# Extract the price amount from the `price` column
results_df['price_cleaned'] = results_df['price'].str.extract(
    r'(R\s?\d[\d\s]*)')

# Remove trailing whitespaces
results_df['price_cleaned'] = results_df['price_cleaned'].str.replace(
    r'\s+', '', regex=True)

# Extract the results of the `price_cleaned` column to the new column `price (ZAR)`
results_df['price (ZAR)'] = results_df['price_cleaned'].str.extract(
    r'R?(\d+)', expand=False)

# Print out the first five rows to see the results
results_df.head()

Unnamed: 0,price,listing_number,property_type,furnished,pets_allowed,backup_water,bedrooms,bathrooms,kitchen,lounge,...,garden,pool,garage,erf_size (m²),floor_area (m²),suburb,city,province,price_cleaned,price (ZAR)
0,R 690 000Bond CostsBond CalculatorPurchase Pri...,115871179,House,,Yes,,3,1,1,1.0,...,,,,248.0,128.0,Bhisho Central,Bhisho,Eastern Cape,R690000,690000
1,R 700 000Bond CostsBond CalculatorPurchase Pri...,115765257,House,,Yes,,3,1,2,,...,,,,280.0,278.0,Tyu Tyu North,Bhisho,Eastern Cape,R700000,700000
2,R 749 000Bond CostsBond CalculatorPurchase Pri...,115864665,House,,,,3,1,1,1.0,...,,,,,,Tyu Tyu North,Bhisho,Eastern Cape,R749000,749000
3,R 799 000Bond CostsBond CalculatorPurchase Pri...,115811196,House,,,,3,1,1,,...,1.0,,,413.0,,Bhisho Central,Bhisho,Eastern Cape,R799000,799000
4,R 695 000Bond CostsBond CalculatorPurchase Pri...,115837796,House,,,,3,2,1,1.0,...,,,,,,Tyu Tyu,Bhisho,Eastern Cape,R695000,695000


In [25]:
# Get number of rows that contain "POA"
poa_mask = listings_for_sale['price'].astype(
    str).str.contains("POA", case=False, na=False)
num_poa = poa_mask.sum()

# Print results
print(f"There are {num_poa} listings with 'POA' in the price column.")

There are 542 listings with 'POA' in the price column.


There are currently 542 rows containing "POA". To clean this, we will calculate the average price based on `suburb`, `city` and `property_type`, and use these averages to impute the missing values.

In [26]:
# Identify POA entries in 'price' for sale listings
poa_mask = results_df['price'].str.startswith('POA', na=False)

# Ensure 'price (ZAR)' is numeric (already extracted earlier)
results_df['price (ZAR)'] = pd.to_numeric(
    results_df['price (ZAR)'], errors='coerce')



# Calculate grouped averages for non-POA values
avg_prices = results_df[~poa_mask & results_df['price (ZAR)'].notna()] \
    .groupby(['suburb', 'city', 'property_type'])['price (ZAR)'] \
    .mean().reset_index().rename(columns={'price (ZAR)': 'price_avg'})

# Merge the average prices back into the main DataFrame
results_df = results_df.merge(
    avg_prices, on=['suburb', 'city', 'property_type'], how='left')

# Impute POA prices with the average (only where POA exists)
results_df.loc[poa_mask, 'price (ZAR)'] = results_df.loc[poa_mask, 'price_avg']

# Drop the helper column
results_df.drop(columns='price_avg', inplace=True)

In [27]:
# Print number of rows that contain "POA"
print("Remaining 'POA':", results_df['price'].str.lower().eq('poa').sum())

# Get total number of null values in the `price(ZAR)` column
print("Remaining nulls 'price (ZAR)':", results_df['price (ZAR)'].isna().sum())

Remaining 'POA': 0
Remaining nulls 'price (ZAR)': 41


In [28]:
# Visualize null rows in the `price(ZAR)` column
results_df[results_df['price (ZAR)'].isna()]

Unnamed: 0,price,listing_number,property_type,furnished,pets_allowed,backup_water,bedrooms,bathrooms,kitchen,lounge,...,garden,pool,garage,erf_size (m²),floor_area (m²),suburb,city,province,price_cleaned,price (ZAR)
2704,POABond CostsBond CalculatorPurchase PriceRDep...,115976173,House,,,,5.0,3.0,1.0,1.0,...,1.0,,2.0,,250.0,Hartebeesfontein SH,Bloemfontein,Free State,,
3024,POABond CostsBond CalculatorPurchase PriceRDep...,111277371,Commercial,Yes,No,,,12.0,2.0,,...,,,,,,Heuwelsig,Bloemfontein,Free State,,
3558,POABond CostsBond CalculatorPurchase PriceRDep...,114108103,Commercial,,,,13.0,14.0,1.0,,...,,,,,882.0,Bayswater,Bloemfontein,Free State,,
4771,POABond CostsBond CalculatorPurchase PriceRDep...,114515238,Industrial,,,,,,,,...,,,,,,Noordhoek,Bloemfontein,Free State,,
5521,POABond CostsBond CalculatorPurchase PriceRDep...,113316750,Townhouse,,,,,,,,...,,,,,,Bloemfontein Rural,Bloemfontein,Free State,,
5621,POABond CostsBond CalculatorPurchase PriceRDep...,114505067,Vacant Land,,,,,,,,...,,,,,,Fichardt Park,Bloemfontein,Free State,,
7230,POABond CostsBond CalculatorPurchase PriceRDep...,115523243,Commercial,,,,,,,,...,,,,,,Gatesville,Cape Town,Western Cape,,
7928,POABond CostsBond CalculatorPurchase PriceRDep...,115865675,Commercial,,,,,,,,...,,,,622.0,,Kirstenhof,Cape Town,Western Cape,,
7955,POABond CostsBond CalculatorPurchase PriceRDep...,115960492,House,,Yes,,4.0,1.0,1.0,1.0,...,,,1.0,487.0,,Zwartdam,Cape Town,Western Cape,,
10449,POABond CostsBond CalculatorPurchase PriceRDep...,114215115,Vacant Land,,,,1.0,1.0,1.0,1.0,...,,1.0,1.0,,,Belle Constantia,Cape Town,Western Cape,,


These 41 rows remain because an average based on the combination of `property_type`, `city` and `suburb` could not be calculated for them. Since they represent a small subset of the data, we will remove them from the dataset. 

In [29]:
# Drop rows with the price (ZAR) column that are null
results_df = results_df.dropna(subset=['price (ZAR)'])

In [30]:
# Drop the price column as we do not need it anymore
results_df.drop(columns=['price'], inplace=True)

# Drop the price_cleaned column as we do not need it anymore
results_df.drop(columns=['price_cleaned'], inplace=True)

We will now proceed to clean the columns that were previously extracted from the `property_features` column. 

### Pets Allowed column

In [31]:
# Get the total number of null values in the `pets_allowed` column
results_df['pets_allowed'].isnull().sum()

np.int64(15540)

In [32]:
# Get the total number of unique values in the `pets_allowed` column
results_df['pets_allowed'].value_counts()

pets_allowed
Yes    15342
No      6948
Name: count, dtype: int64

We will calculate the mode of the `pets_allowed` column for each property type to identify the most common pet policy associated with each major property category. 

In [33]:
# Calculate mode for each property type
pets_mode = results_df.groupby('property_type')['pets_allowed'].agg(
    lambda x: x.mode().iloc[0] if not x.mode().empty else pd.NA
)

# Print results
print(pets_mode)

property_type
Apartment / Flat     No
Commercial           No
Farm                Yes
House               Yes
Industrial           No
Townhouse           Yes
Vacant Land          No
Name: pets_allowed, dtype: object


We will then use the mode values for each property type to fill in the missing values in the `pets_allowed` column. 

In [34]:
# Initialise a dictionary that contains the common mode values for each property type
default_pets_value = {
    'Apartment / Flat': 'No',
    'Commercial': 'No',
    'Farm': 'Yes',
    'House': 'Yes',
    'Industrial': 'No',
    'Townhouse': 'Yes',
    'Vacant Land': 'No'
}

# Apply the default_pet_value for all null values
results_df['pets_allowed'] = results_df.apply(
    lambda row: default_pets_value[row['property_type']]
    if pd.isna(row['pets_allowed']) and row['property_type'] in default_pets_value
    else row['pets_allowed'],
    axis=1
)

In [35]:
# Verify that no null values exist in the `pets_allowed` column
print(results_df['pets_allowed'].isna().sum())

0


### Furnished column

In [36]:
# Get the number of null values in the `furnished` column
results_df['furnished'].isnull().sum()

np.int64(36800)

In [37]:
# Get the total number of unique value counts in the `furnished` column
results_df['furnished'].value_counts()

furnished
Yes    1030
Name: count, dtype: int64

We will calculate the mode of the `furnished` column for each property type to identify the most common furnishing status associated with each major property category.

In [38]:
# Calculate mode for each property type for the furnished column
furnished_mode = results_df.groupby('property_type')['furnished'].agg(
    lambda x: x.mode().iloc[0] if not x.mode().empty else pd.NA
)
print(furnished_mode)

property_type
Apartment / Flat     Yes
Commercial           Yes
Farm                 Yes
House                Yes
Industrial          <NA>
Townhouse            Yes
Vacant Land          Yes
Name: furnished, dtype: object


To maintain data consistency, we will fill in the null values in the `furnished` column with *"Unknown"*.

In [39]:
# Fill the null values in the furnished column with "Unknown"
results_df['furnished'] = results_df['furnished'].fillna("Unknown")

In [40]:
# Verify that there are no null values in the `furnished` column
print(results_df['furnished'].isna().sum())

0


### Bedrooms column

In [41]:
# Get the total number of null values in the `bedrooms` column
results_df['bedrooms'].isnull().sum()

np.int64(6249)

In [42]:
# Get the number of unique value counts in the `bedrooms` column
results_df['bedrooms'].value_counts()

bedrooms
3     11223
2      7751
4      5399
1      2717
5      2260
      ...  
78        1
37        1
65        1
64        1
34        1
Name: count, Length: 64, dtype: Int64

For the `bedrooms` column, we calculate the median values for each property type to understand the central tendency within the dataset.

In [43]:
# Ensure that the bedrooms column is numeric
results_df['bedrooms'] = pd.to_numeric(
    results_df['bedrooms'], errors='coerce').astype('Int64')


# Group by property type and calculate median
median_kitchens = results_df.groupby('property_type')[
    'bedrooms'].median().reset_index()

# Rename columns for clarity
median_kitchens.columns = ['property_type', 'median_kitchen']

# Sort by median descending (optional)
median_kitchens = median_kitchens.sort_values(
    by='median_kitchen', ascending=False)

print(median_kitchens)

      property_type  median_kitchen
1        Commercial            12.5
2              Farm             4.0
3             House             3.0
6       Vacant Land             3.0
5         Townhouse             3.0
0  Apartment / Flat             2.0
4        Industrial             2.0


There appear to be outliers in the `bedrooms` column, with a maximum value of 244. To further investigate, we will identify and visualise the number of outlier entries per property type, using 15 bedrooms as the threshold. 

In [44]:
# Ensure 'bedrooms' is numeric
results_df['bedrooms'] = pd.to_numeric(
    results_df['bedrooms'], errors='coerce').astype('Int64')

# Filter listings with bedrooms > 15
bedroom_outliers = results_df[results_df['bedrooms'] > 15]

# Group by property_type and count
bedroom_summary = bedroom_outliers.groupby(
    'property_type')['bedrooms'].count().reset_index()

# Rename column for clarity
bedroom_summary.columns = ['property_type', 'Count with >15 bedrooms']

# Sort by count descending
bedroom_summary = bedroom_summary.sort_values(
    by='Count with >15 bedrooms', ascending=False)

print(bedroom_summary)

      property_type  Count with >15 bedrooms
3             House                      101
1        Commercial                       69
0  Apartment / Flat                       25
2              Farm                       15
4         Townhouse                        3


In total, there are 213 outlier entries in the `bedrooms` column, with the highest count found in the house property type. To address those, we replace the outlier values with the median number of bedrooms for each property type. If a median value cannot be determined (i.e., is missing), we substitute it with zero. 

For the remaining null values, we impute them using the median number of bedrooms for each property type.

In [45]:
bedroom_threshold = 15

# Identify outliers
bedroom_outliers = results_df[results_df['bedrooms'] > bedroom_threshold]

# Compute medians and replace NaNs with 0
bedroom_median = results_df[results_df['bedrooms'] <= bedroom_threshold] \
    .groupby('property_type')['bedrooms'].median().fillna(0)

# Replace outlier values
for idx, row in bedroom_outliers.iterrows():
    prop_type = row['property_type']
    median_val = bedroom_median.get(prop_type, 1)
    results_df.at[idx, 'bedrooms'] = median_val

In [46]:
# Fill NaN bedrooms values with property-type median, or 0 if still missing
def fill_missing_bedrooms(row):
    if pd.isna(row['bedrooms']):
        return bedroom_median.get(row['property_type'], 0)
    return row['bedrooms']


# Apply the above function to the `bedrooms` column
results_df['bedrooms'] = results_df.apply(fill_missing_bedrooms, axis=1)

# Convert results to integers
results_df['bedrooms'] = results_df['bedrooms'].round().astype('Int64')

In [47]:
# Verify there are no null values in the `bedrooms` column
results_df['bedrooms'].isnull().sum()

np.int64(0)

### Bathrooms column

In [48]:
# Get the total number of null values in the `bathrooms` column
results_df['bathrooms'].isnull().sum()

np.int64(5000)

In [49]:
# Get the total number of unique value counts in the `bathrooms` column
results_df['bathrooms'].value_counts()

bathrooms
2      13078
1      11229
3       4659
4       1864
5        768
6        376
7        192
8        150
10       103
9         77
12        66
11        41
15        31
14        26
13        26
20        17
18        16
21        15
16        13
36         7
0          6
24         6
25         5
17         5
22         4
19         4
27         4
30         3
47         3
26         3
65         3
48         3
23         3
31         2
28         2
40         2
57         1
84         1
137        1
131        1
82         1
49         1
52         1
78         1
85         1
112        1
33         1
50         1
29         1
54         1
55         1
60         1
32         1
34         1
Name: count, dtype: int64

There appear to be outliers in the `bathrooms` column, with a maximum value of 137. To further investigate, we will identify and visualise the number of outlier entries per property type, using 10 bathrooms as the threshold. 

In [50]:
# Ensure 'bathrooms' is numeric
results_df['bathrooms'] = pd.to_numeric(
    results_df['bathrooms'], errors='coerce').astype('Int64')

# Filter listings with bathrooms > 10
bathroom_outliers = results_df[results_df['bathrooms'] > 10]

# Group by property_type and count
bathroom_summary = bathroom_outliers.groupby(
    'property_type')['bathrooms'].count().reset_index()

# Rename column for clarity
bathroom_summary.columns = ['property_type', 'Count with >10 bathrooms']

# Sort by count descending
bathroom_summary = bathroom_summary.sort_values(
    by='Count with >10 bathrooms', ascending=False)

print(bathroom_summary)

      property_type  Count with >10 bathrooms
1        Commercial                       130
3             House                       127
2              Farm                        26
0  Apartment / Flat                        25
4        Industrial                        16
5         Townhouse                         4


In total, there are 328 outlier entries in the `bathrooms` column, with the highest count found in the commercial property type. To address those, we replace the outlier values with the median number of bathrooms for each property type. If a median value cannot be determined (i.e., is missing), we substitute it with zero. 

For the remaining null values, we impute them using the median number of bathrooms for each property type.

In [51]:
# Set the bathroom_threshold for outlier bathroom numbers
bathroom_threshold = 10

# Define the outlier bathrooms as specified by the bathroom_threshold
outliner_bathrooms = results_df[results_df['bathrooms'] > bathroom_threshold]

bathroom_median = results_df[results_df['bathrooms'] <= bathroom_threshold].groupby(
    'property_type')['bathrooms'].median()

bathroom_median = bathroom_median.fillna(0)

for idx, row in outliner_bathrooms.iterrows():
    prop_type = row['property_type']
    median_val = bathroom_median.get(prop_type, 0)
    results_df.at[idx, 'bathrooms'] = median_val

In [52]:
# Fill NaN bathroom values with property-type median, or 0 if still missing
def fill_missing_bathrooms(row):
    if pd.isna(row['bathrooms']):
        return bathroom_median.get(row['property_type'], 0)
    return row['bathrooms']


# Apply the above function to the `bathrooms` column
results_df['bathrooms'] = results_df.apply(fill_missing_bathrooms, axis=1)

# Convert results to integers
results_df['bathrooms'] = results_df['bathrooms'].round().astype('Int64')

In [53]:
# Verify there are no null values in the `bathrooms` column
results_df['bathrooms'].isnull().sum()

np.int64(0)

### Kitchen column

In [54]:
# Get the total number of null values in the `kitchen` column
results_df['kitchen'].isnull().sum()

np.int64(8978)

In [55]:
# Get the unique value counts for the `kitchen` column
results_df['kitchen'].value_counts()

kitchen
1      27115
2       1048
3        304
4        148
5         82
6         25
8         20
7         18
18        15
10        10
11         9
14         9
12         8
15         6
9          5
16         4
27         4
20         3
13         2
47         1
51         1
17         1
84         1
25         1
131        1
115        1
65         1
34         1
112        1
78         1
85         1
24         1
19         1
48         1
40         1
32         1
Name: count, dtype: int64

For the `kitchen` column, we calculate the median values for each property type to understand the central tendency within the dataset.

In [56]:
# Ensure that the kitchen column is numeric
results_df['kitchen'] = pd.to_numeric(
    results_df['kitchen'], errors='coerce').astype('Int64')


# Group by property type and calculate median
median_kitchens = results_df.groupby('property_type')[
    'kitchen'].median().reset_index()

# Rename columns for clarity
median_kitchens.columns = ['property_type', 'median_kitchen']

# Sort by median descending
median_kitchens = median_kitchens.sort_values(
    by='median_kitchen', ascending=False)

print(median_kitchens)

      property_type  median_kitchen
0  Apartment / Flat             1.0
1        Commercial             1.0
2              Farm             1.0
3             House             1.0
4        Industrial             1.0
5         Townhouse             1.0
6       Vacant Land             1.0


There are also outliers in the dataset, with the maximum number of kitchens reaching 131. To address this, we replace the outliers with the median kitchen value for each property type, defaulting to zero if the median is unavailable. 

For the remaining null values, we impute the Kitchen column using the median value for the respective property type, defaulting to 0 again if the median is unavailable.

In [57]:
# Define threshold
kitchen_threshold = 9

# Identify outliers
kitchen_outliers = results_df[results_df['kitchen'] > kitchen_threshold]

# Compute medians and replace NaNs with 0
kitchen_median = results_df[results_df['kitchen'] <= kitchen_threshold] \
    .groupby('property_type')['kitchen'].median().fillna(0)

# Replace outlier values
for idx, row in kitchen_outliers.iterrows():
    prop_type = row['property_type']
    median_val = kitchen_median.get(prop_type, 1)
    results_df.at[idx, 'kitchen'] = median_val

In [58]:
# Fill NaN kitchen values with property-type median, or 0 if still missing
def fill_missing_kitchen(row):
    if pd.isna(row['kitchen']):
        return kitchen_median.get(row['property_type'], 0)
    return row['kitchen']


# Apply the function
results_df['kitchen'] = results_df.apply(fill_missing_kitchen, axis=1)

# Convert results to integers
results_df['kitchen'] = results_df['kitchen'].round().astype('Int64')

In [59]:
# Verify that there are no null values
results_df['kitchen'].isnull().sum()

np.int64(0)

### Lounge column

In [60]:
# Get the total number of null values in the `lounge` column
results_df['lounge'].isnull().sum()

np.int64(23922)

In [61]:
# Get the unique value counts of the `lounge` column
results_df['lounge'].value_counts()

lounge
1     13150
2       575
3       122
4        31
5        15
6         4
7         4
9         3
8         2
12        1
11        1
Name: count, dtype: int64

For the `lounge` column, we calculate the median values for each property type to understand the central tendency within the dataset.

In [62]:
# Ensure that the lounge column is numeric
results_df['lounge'] = pd.to_numeric(
    results_df['lounge'], errors='coerce').astype('Int64')


# Group by property type and calculate median
median_lounge = results_df.groupby('property_type')[
    'lounge'].median().reset_index()

# Rename columns for clarity
median_lounge.columns = ['property_type', 'median_lounge']

# Sort by median descending (optional)
median_lounge = median_lounge.sort_values(
    by='median_lounge', ascending=False)

print(median_lounge)

      property_type  median_lounge
0  Apartment / Flat            1.0
1        Commercial            1.0
2              Farm            1.0
3             House            1.0
4        Industrial            1.0
5         Townhouse            1.0
6       Vacant Land            1.0


There are outliers in the dataset, with the maximum number of lounges reaching 12. To address this, we replace the outliers with the median lounge value for each property type, defaulting to zero if the median is unavailable. 

For the remaining null values, we impute the lounge column using the median value for the respective property type, defaulting to 0 again if the median is unavailable.

In [63]:
# Define threshold
lounge_threshold = 9

# Identify outliers
lounge_outliers = results_df[results_df['lounge'] > lounge_threshold]

# Compute medians and replace NaNs with 0
lounge_median = results_df[results_df['lounge'] <= lounge_threshold] \
    .groupby('property_type')['lounge'].median().fillna(0)

# Replace outlier values
for idx, row in lounge_outliers.iterrows():
    prop_type = row['property_type']
    median_val = lounge_median.get(prop_type, 1)
    results_df.at[idx, 'lounge'] = median_val

In [64]:
# Fill NaN lounge values with property-type median, or 0 if still missing
def fill_missing_lounge(row):
    if pd.isna(row['lounge']):
        return lounge_median.get(row['property_type'], 0)
    return row['lounge']


# Apply the function
results_df['lounge'] = results_df.apply(fill_missing_lounge, axis=1)

# Convert results to integers
results_df['lounge'] = results_df['lounge'].round().astype('Int64')

In [65]:
# Verify that there are no null values present
results_df['lounge'].isnull().sum()

np.int64(0)

### Dining Room column

In [66]:
# Get the total number of null values in the `dining_room` column
results_df['dining_room'].isnull().sum()

np.int64(26440)

In [67]:
# Get the unique value counts in the `Dining Room` column
results_df['dining_room'].value_counts()

dining_room
1      11118
2        228
3         27
4         10
5          3
7          2
122        1
8          1
Name: count, dtype: int64

For the `dining_room` column, we calculate the median values for each property type to understand the central tendency within the dataset.

In [68]:
# Ensure that the dining_room column is numeric
results_df['dining_room'] = pd.to_numeric(
    results_df['dining_room'], errors='coerce').astype('Int64')


# Group by property type and calculate median
median_dining_room = results_df.groupby('property_type')[
    'dining_room'].median().reset_index()

# Rename columns for clarity
median_dining_room.columns = ['property_type', 'median_dining_room']

# Sort by median descending (optional)
median_dining_room = median_dining_room.sort_values(
    by='median_dining_room', ascending=False)

print(median_dining_room)

      property_type  median_dining_room
0  Apartment / Flat                 1.0
1        Commercial                 1.0
2              Farm                 1.0
3             House                 1.0
5         Townhouse                 1.0
6       Vacant Land                 1.0
4        Industrial                <NA>


The industrial property type has a median dining room value of null, which is expected given the nature of the property. There are outliers in the dataset, with the maximum number of dining rooms reaching 122. To address this, we replace the outliers with the median dining room value for each property type, defaulting to zero if the median is unavailable. 

For the remaining null values, we impute the dining room column using the median value for the respective property type, defaulting to 0 again if the median is unavailable.

In [69]:
# Define threshold
dining_room_threshold = 9

# Identify outliers
dining_room_outliers = results_df[results_df['dining_room'] > dining_room_threshold]

# Compute medians and replace NaNs with 0
lounge_median = results_df[results_df['dining_room'] <= dining_room_threshold] \
    .groupby('property_type')['dining_room'].median().fillna(0)

# Replace outlier values
for idx, row in dining_room_outliers.iterrows():
    prop_type = row['property_type']
    median_val = lounge_median.get(prop_type, 1)
    results_df.at[idx, 'dining_room'] = median_val

In [70]:
# Fill NaN dining_room values with property-type median, or 0 if still missing
def fill_missing_dining_room(row):
    if pd.isna(row['dining_room']):
        return lounge_median.get(row['property_type'], 0)
    return row['dining_room']


# Apply the function
results_df['dining_room'] = results_df.apply(fill_missing_dining_room, axis=1)

# Convert results to integers
results_df['dining_room'] = results_df['dining_room'].round().astype('Int64')

In [71]:
# Verify that no null values exist in the column
results_df['dining_room'].isnull().sum()

np.int64(0)

### Garage column

In [72]:
# Get the total number of null values in the `garage` dataset
results_df['garage'].isnull().sum()

np.int64(28417)

In [73]:
# Get the unique value counts for the `garage` column
results_df['garage'].value_counts()

garage
1     4210
2     4068
3      583
4      386
5       67
6       54
8        9
7        8
0        8
10       4
12       3
40       3
13       2
18       1
9        1
25       1
15       1
34       1
20       1
19       1
24       1
Name: count, dtype: int64

For the `garage` column, we calculate the median values for each property type to understand the central tendency within the dataset.

In [74]:
# Ensure that the garage column is numeric
results_df['garage'] = pd.to_numeric(
    results_df['garage'], errors='coerce').astype('Int64')


# Group by property type and calculate median
median_garage = results_df.groupby('property_type')[
    'garage'].median().reset_index()

# Rename columns for clarity
median_garage.columns = ['property_type', 'median_garage']

# Sort by median descending (optional)
median_garage = median_garage.sort_values(
    by='median_garage', ascending=False)

print(median_garage)

      property_type  median_garage
1        Commercial            2.0
2              Farm            2.0
6       Vacant Land            2.0
3             House            2.0
0  Apartment / Flat            1.0
5         Townhouse            1.0
4        Industrial           <NA>


The industrial property type has a median garage value of null, which is expected given its nature. There are also outliers in the dataset, with the maximum number of garages reaching 40.

To address this, we replace the outliers with the median garage value for each property type, defaulting to zero if the median is unavailable. 

For the remaining null values, we impute the garage column using the median value for the respective property type, defaulting to 0 again if the median is unavailable.

In [75]:
# Define the threshold
garage_threshold = 6

# Identify outliers
garage_outliers = results_df[results_df['garage'] > garage_threshold]

# Compute medians and replace NaNs with 0
garage_median = results_df[results_df['garage'] <= garage_threshold].groupby(
    'property_type')['garage'].median().fillna(0)

# Replace outlier values
for idx, row in garage_outliers.iterrows():
    prop_type = row['property_type']
    median_val = garage_median.get(prop_type, 1)
    results_df.at[idx, 'garage'] = median_val

In [76]:
# Fill NaN garage values with property-type median, or 0 if still missing
def fill_missing_garage(row):
    if pd.isna(row['garage']):
        return garage_median.get(row['property_type'], 0)
    return row['garage']


# Apply the function
results_df['garage'] = results_df.apply(fill_missing_garage, axis=1)

# Convert the column to integer
results_df['garage'] = results_df['garage'].round().astype('Int64')

In [77]:
# Verify that there are no null values
results_df['garage'].isnull().sum()

np.int64(0)

### Garden column

In [78]:
# Get the total number of null values in the `garden` column
results_df['garden'].isnull().sum()

np.int64(27616)

In [79]:
# Get the unique value counts
results_df['garden'].value_counts()

garden
1    10042
2      157
3        8
4        4
7        2
5        1
Name: count, dtype: int64

For the `garden` column, we calculate the median values for each property type to understand the central tendency within the dataset.

In [80]:
# Ensure that the garden column is numeric
results_df['garden'] = pd.to_numeric(
    results_df['garden'], errors='coerce').astype('Int64')


# Group by property type and calculate median
median_garden = results_df.groupby('property_type')[
    'garden'].median().reset_index()

# Rename columns for clarity
median_garden.columns = ['property_type', 'median_garden']

# Sort by median descending (optional)
median_garden = median_garden.sort_values(
    by='median_garden', ascending=False)

print(median_garden)

      property_type  median_garden
0  Apartment / Flat            1.0
1        Commercial            1.0
2              Farm            1.0
3             House            1.0
5         Townhouse            1.0
6       Vacant Land            1.0
4        Industrial           <NA>


The median number of gardens is consistent across all property types, except for Industrial, which has a null median—an expected outcome, as most properties are likely to include at least one garden, whereas industrial areas typically consist of warehouses, factories, or plants with minimal landscaping. Additionally, there are outliers in the dataset, with some properties reporting up to 7 gardens.

To address this:

- We replace the outliers with the median garden count for their respective property types, using 0 as a fallback if the median is not available.
- Remaining null values in the garden column are then imputed using the same median-based strategy per property type, again defaulting to 0 where necessary.

In [81]:
# Define threshold
garden_threshold = 3

# Identify outliers
garden_outliers = results_df[results_df['garden'] > garden_threshold]

# Compute medians and replace NaNs with 0
garden_median = results_df[results_df['garden'] <= garden_threshold].groupby(
    'property_type')['garden'].median().fillna(0)

# Replace outlier values
for idx, row in garden_outliers.iterrows():
    prop_type = row['property_type']
    median_val = garden_median.get(prop_type, 1)
    results_df.at[idx, 'garden'] = median_val

In [82]:
# Fill NaN garden values with property-type median, or 0 if still missing
def fill_missing_garden(row):
    if pd.isna(row['garden']):
        return garage_median.get(row['property_type'], 0)
    return row['garden']


# Apply the function
results_df['garden'] = results_df.apply(fill_missing_garden, axis=1)

# Convert the column to integers
results_df['garden'] = results_df['garden'].round().astype('Int64')

In [83]:
# Verify that there are no null values
results_df['garden'].isnull().sum()

np.int64(0)

### Pool column

In [84]:
# Get the total number of null values in the `pool` column
results_df['pool'].isnull().sum()

np.int64(31879)

In [85]:
# Get the unique value counts
results_df['pool'].value_counts()

pool
1    5866
2      84
3       1
Name: count, dtype: int64

For the `pool` column, we calculate the median values for each property type to understand the central tendency within the dataset.

In [86]:
# Ensure that the pool column is numeric
results_df['pool'] = pd.to_numeric(
    results_df['pool'], errors='coerce').astype('Int64')


# Group by property type and calculate median
median_pool = results_df.groupby('property_type')[
    'pool'].median().reset_index()

# Rename columns for clarity
median_pool.columns = ['property_type', 'median_pool']

# Sort by median descending (optional)
median_pool = median_pool.sort_values(
    by='median_pool', ascending=False)

print(median_pool)

      property_type  median_pool
0  Apartment / Flat          1.0
1        Commercial          1.0
2              Farm          1.0
3             House          1.0
5         Townhouse          1.0
6       Vacant Land          1.0
4        Industrial         <NA>


The median values for the industrial property type is null, which is expected given the nature of these properties. For the missing values, we will impute zeros, as a pool is typically a prominent feature intended to attract customers—if it were present, it would likely have been explicitly listed.

In [87]:
# Impute the null values with zero
results_df['pool'] = results_df['pool'].fillna(0)

In [88]:
# Verify that there are no null values
results_df['pool'].isnull().sum()

np.int64(0)

### Parking column

In [89]:
# Get the total number of null values in the `parking` column
results_df['parking'].isnull().sum()

np.int64(14311)

In [90]:
# Get the unique value count
results_df['parking'].value_counts()

parking
1      10041
2       6470
4       2086
3       1567
6        929
       ...  
58         1
233        1
51         1
196        1
31         1
Name: count, Length: 75, dtype: int64

For the `parking` column, we calculate the median values for each property type to understand the central tendency within the dataset.

In [91]:
# Ensure that the parking column is numeric
results_df['parking'] = pd.to_numeric(
    results_df['parking'], errors='coerce').astype('Int64')


# Group by property type and calculate median
median_parking = results_df.groupby('property_type')[
    'parking'].median().reset_index()

# Rename columns for clarity
median_parking.columns = ['property_type', 'median_parking']

# Sort by median descending (optional)
median_parking = median_parking.sort_values(
    by='median_parking', ascending=False)

print(median_parking)

      property_type  median_parking
4        Industrial            10.0
1        Commercial             8.0
2              Farm             4.0
6       Vacant Land             2.0
3             House             2.0
0  Apartment / Flat             1.0
5         Townhouse             1.0


There are also outliers in the dataset, with the maximum number of parking reaching 300. The median values also support this observation, with the median value of the industrial type being 10, which is very high. 

To address this, we replace the outliers with the median parking value for each property type, defaulting to zero if the median is unavailable. 

For the remaining null values, we impute the parking column using the median value for the respective property type, defaulting to 0 again if the median is unavailable.

In [92]:
# Define the threshold
parking_threshold = 5

# Identify outliers
parking_outliers = results_df[results_df['parking'] > parking_threshold]

# Compute medians and replace NaNs with 0
parking_median = results_df[results_df['parking'] <= parking_threshold].groupby(
    'property_type')['parking'].median().fillna(0)

# Replace outlier values
for idx, row in parking_outliers.iterrows():
    prop_type = row['property_type']
    median_val = parking_median.get(prop_type, 1)
    results_df.at[idx, 'parking'] = median_val

In [93]:
# Fill NaN parking values with property-type median, or 0 if still missing
def fill_missing_parking(row):
    if pd.isna(row['parking']):
        return garage_median.get(row['property_type'], 0)
    return row['parking']


# Apply the function
results_df['parking'] = results_df.apply(fill_missing_parking, axis=1)

# Convert the column to integer type
results_df['parking'] = results_df['parking'].round().astype('Int64')

In [94]:
# Verify that null values are not present anymore.
results_df['parking'].isnull().sum()

np.int64(0)

### Backup Water column

In [95]:
# Get the total number of null values in the `backup_water` column
results_df['backup_water'].isnull().sum()

np.int64(32892)

In [96]:
# Get the unique value counts
results_df['backup_water'].value_counts()

backup_water
No     3203
Yes    1735
Name: count, dtype: int64

We will calculate the mode of the `backup_water` column for each property type to determine the most common backup water availability status associated with each major property category.

In [97]:
# Calculate mode for each property type for the furnished column
backup_water_mode = results_df.groupby('property_type')['backup_water'].agg(
    lambda x: x.mode().iloc[0] if not x.mode().empty else pd.NA
)

# Print results
print(backup_water_mode)

property_type
Apartment / Flat     No
Commercial          Yes
Farm                Yes
House                No
Industrial           No
Townhouse            No
Vacant Land          No
Name: backup_water, dtype: object


To maintain data consistency, we will fill in the null values in the `backup_water` column with *"Unknown"*.

In [98]:
# Impute null values with "Unknown"
results_df['backup_water'] = results_df['backup_water'].fillna('Unknown')

In [99]:
# Verify no null values exist
results_df['backup_water'].isnull().sum()

np.int64(0)

### Floor Area column

In [100]:
# Get the total number of null values in the `floor_area column
results_df['floor_area (m²)'].isnull().sum()

np.int64(19915)

In [101]:
# Get the unique value counts
results_df['floor_area (m²)'].value_counts()

floor_area (m²)
48     237
60     192
300    177
70     169
400    158
      ... 
657      1
866      1
569      1
2        1
617      1
Name: count, Length: 849, dtype: int64

The `floor_area` column contains extremely small values, starting from as low as 1 m², which are considered outliers since no property is realistically that small. To address this, we treat all values less than 50 m² as missing and replace them using the median floor area, grouped by property_type, suburb, and a newly created `price_bin` column that categorizes properties into price ranges.

Any remaining missing values are then filled with the overall median floor area across all rows.

In [102]:
# Convert floor area column to numeric
results_df['floor_area (m²)'] = pd.to_numeric(
    results_df['floor_area (m²)'], errors='coerce').astype('Int64')


# Create price bins to help group properties by price range
results_df['price_bin'] = pd.qcut(
    results_df['price (ZAR)'], q=4, duplicates='drop')


# Treat extremely small floor areas (<50 m²) as missing values
results_df.loc[results_df['floor_area (m²)'] < 50, 'floor_area (m²)'] = pd.NA


# Calculate grouped median floor area for similar properties
group_means = results_df.groupby(
    ['property_type', 'suburb', 'price_bin']
)['floor_area (m²)'].mean()


# Function to replace missing floor area with grouped median
def replace_with_group_mean(row):
    if pd.isna(row['floor_area (m²)']):
        try:
            return group_means.loc[
                (row['property_type'],
                 row['suburb'], row['price_bin'])
            ]
        except KeyError:
            return np.nan
    else:
        return row['floor_area (m²)']


# Apply the replacement
results_df['floor_area (m²)'] = results_df.apply(
    replace_with_group_mean, axis=1)

In [103]:
# Fill in remaining null values with the median across all rows
results_df['floor_area (m²)'].fillna(
    results_df['floor_area (m²)'].median(), inplace=True
)

In [104]:
# Verify that null values do not exist
results_df['floor_area (m²)'].isnull().sum()

np.int64(0)

### Erf Size column

In [105]:
# Get the total number of null values in the `erf_size` column
results_df['erf_size (m²)'].isnull().sum()

np.int64(24065)

In [106]:
# Get the unique value counts
results_df['erf_size (m²)'].value_counts()

erf_size (m²)
495    747
496    315
990    183
400    167
600    142
      ... 
131      1
192      1
857      1
16       1
921      1
Name: count, Length: 987, dtype: int64

The `erf_size` column contains extremely small values, starting from as low as 1 m², which are considered outliers since no property is realistically that small. To address this, we treat all values less than 50 m² as missing and replace them using the median erf size, grouped by property_type, suburb, and a newly created `price_bin` column that categorizes properties into price ranges.

Any remaining missing values are then filled with the overall median floor area across all rows.

In [107]:
# Convert floor area column to numeric
results_df['erf_size (m²)'] = pd.to_numeric(
    results_df['erf_size (m²)'], errors='coerce').astype('Int64')


# Create price bins to help group properties by price range
results_df['price_bin'] = pd.qcut(
    results_df['price (ZAR)'], q=4, duplicates='drop')


# Treat extremely small floor areas (<50 m²) as missing values
results_df.loc[results_df['erf_size (m²)'] < 50, 'erf_size (m²)'] = pd.NA


# Calculate grouped median floor area for similar properties
group_means = results_df.groupby(
    ['property_type', 'suburb', 'price_bin']
)['erf_size (m²)'].mean()


# Function to replace missing floor area with grouped median
def replace_with_group_mean(row):
    if pd.isna(row['erf_size (m²)']):
        try:
            return group_means.loc[
                (row['property_type'],
                 row['suburb'], row['price_bin'])
            ]
        except KeyError:
            return np.nan
    else:
        return row['erf_size (m²)']


# Apply the replacement
results_df['erf_size (m²)'] = results_df.apply(
    replace_with_group_mean, axis=1)

In [108]:
# Fill in remaining null values with the median across all rows
results_df['erf_size (m²)'].fillna(
    results_df['erf_size (m²)'].median(), inplace=True
)

In [109]:
# Verify that no null values exist
results_df['erf_size (m²)'].isnull().sum()

np.int64(0)

In [110]:
# Drop price_bin as we do not need it anymore
results_df.drop(columns=['price_bin'], inplace=True)

In [111]:
# Print a summary of the cleaned dataset
results_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37830 entries, 0 to 37870
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   listing_number   37830 non-null  object 
 1   property_type    37830 non-null  object 
 2   furnished        37830 non-null  object 
 3   pets_allowed     37830 non-null  object 
 4   backup_water     37830 non-null  object 
 5   bedrooms         37830 non-null  Int64  
 6   bathrooms        37830 non-null  Int64  
 7   kitchen          37830 non-null  Int64  
 8   lounge           37830 non-null  Int64  
 9   dining_room      37830 non-null  Int64  
 10  parking          37830 non-null  Int64  
 11  garden           37830 non-null  Int64  
 12  pool             37830 non-null  Int64  
 13  garage           37830 non-null  Int64  
 14  erf_size (m²)    37830 non-null  float64
 15  floor_area (m²)  37830 non-null  float64
 16  suburb           37830 non-null  object 
 17  city             

## 4. Save the transformed data
[Back to Table of Contents](#table-of-contents)

---

| ⚡ Description: Storing our pre-processed data ⚡ |
| :--------------------------- |

---

In [112]:
# Define the file path
save_path = ("../data/processed/cleaned_listings_for_sale.csv")

# Save to CSV
results_df.to_csv(save_path, index=False)