# <center> 🧽 Data Cleaning Part 1 </center>

## Table of Contents

- [1. Introduction](#1-introduction)
- [2. Loading Data](#2-loading-data)
- [3. Exploratory Data Analysis](#3-exploratory-data-analysis)
- [4. Save the transformed data](#4-save-the-transformed-data)

## 1. Introduction
[Back to Table of Contents](#table-of-contents)

---

| ⚡ Description: Introducing the project ⚡ |
| :--------------------------- |

---

The South African property market is as diverse as its nine provinces, which each capital city offering a distinctive combination of lifestyle, econonomic activity, and real estate dynamics. In this project, we focus on analyzing property listings across South Africa's provincial capitals to uncover trends, identify investment opportunites, and evaluate affordability across regions.

Real estate decisions, whether to buy, rent or invest rely heavily on accurate, data-driven insights. This notebook is part of a broader project aimed at equipping investors, analysts, and policy makers with a clearer view of the current landscape. By leveraging web-scraped data from online property listings, we examine key features such as property types, pricing, location, and amenities. 

Through careful data cleaning, exploratory data analysis, and visualization, we aim to:

- Compare rental and sale prices across capital cities
- Identify supply-demand dynamics
- Analyse the affordability and investment potential of different regions
- Explore how property features and economic context influence pricing

The findings from this notebook feed into a Power BI dashboard that allows for dynamic and interactive exploration of property trends across the country. 

This notebook documents each step of the analysis pipeline -starting from raw data processing to uncovering insights to ensure transparency, reproducibility, and a foundation for further exploration

## 2. Loading Data
[Back to Table of Contents](#table-of-contents)

---
    
| ⚡ Description: Loading the data in our environment ⚡ |
| :--------------------------- |

---

### Import packages

In [1]:
# Import packages
import re
import warnings

import numpy as np
import pandas as pd

warnings.filterwarnings("ignore", category=FutureWarning)

In [2]:
# Load the rental listings dataset
rental_listings = pd.read_csv("../data/raw/rental_listings_for_capital_cities.csv")

# Display the first few rows of the dataset and its shape
display("Rentals", rental_listings.head(), rental_listings.shape)

'Rentals'

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug
0,R 7 000Per Month,"Bhisho, Tyu Tyu North",3 Bedroom House to Rent in Tyu Tyu North,RE/MAX Border is proud to present this 3-bedro...,Property OverviewListing Number115295115Type o...,3-bedroom-house-to-rent-in-tyu-tyu-north-11529...
1,R 7 000Per Month,"Bhisho, Bhisho Central",3 Bedroom House to Rent in Bhisho Central,How CONVENIENT is this flat directly opposite ...,Property OverviewListing Number115332086Type o...,3-bedroom-house-to-rent-in-bhisho-central-1153...
2,R 5 300Per Month,"Bhisho, Tyu Tyu North",2 Bedroom House to Rent in Tyu Tyu North,Comfortable Home for a family which is availab...,Property OverviewListing Number115210602Type o...,2-bedroom-house-to-rent-in-tyu-tyu-north-11521...
3,,,,,,house-to-rent-in-bisho-park-101731048
4,R 7 500Per Month,"Bhisho, Bisho Park",3 Bedroom House to Rent in Bisho Park,BetterProp is thrilled to present this well-ma...,Property OverviewListing Number115555490Type o...,3-bedroom-house-to-rent-in-bisho-park-115555490


(15324, 6)

## 3. Exploratory Data Analysis
[Back to Table of Contents](#table-of-contents)

---
    
| ⚡ Description: Exploring the data ⚡ |
| :--------------------------- |

---

In [3]:
# Print a summary of the dataset
rental_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15324 entries, 0 to 15323
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   price                 15323 non-null  object
 1   location              8189 non-null   object
 2   property_title        15323 non-null  object
 3   property_description  15323 non-null  object
 4   property_features     15323 non-null  object
 5   listing_slug          15324 non-null  object
dtypes: object(6)
memory usage: 718.4+ KB


In [4]:
# Print the total number of null values in the dataset
rental_null_values = rental_listings.isnull().sum().sum()
print(f"There are {rental_null_values} null values in the rental dataset")

There are 7139 null values in the rental dataset


In [5]:
# Print the number of null values in each column
rental_listings.isnull().sum()

price                      1
location                7135
property_title             1
property_description       1
property_features          1
listing_slug               0
dtype: int64

In [6]:
# Print the rows that have null values in four columns or more
rental_listings[rental_listings.isnull().sum(axis=1) >= 4]

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug
3,,,,,,house-to-rent-in-bisho-park-101731048


In [7]:
# Drop the rows that have null values in four columns or more
rental_listings.drop(
    rental_listings[rental_listings.isnull().sum(axis=1) >= 2].index, inplace=True)

In [8]:
# Get the total number of duplicate rows in the dataset
rental_duplicates = rental_listings[rental_listings.duplicated()]
num_rental_duplicates = len(rental_duplicates)
print(f"There are {num_rental_duplicates} duplicates in the rental dataset ")

There are 1303 duplicates in the rental dataset 


In [9]:
# Drop duplicate rows and keep the first occurrence
rental_listings = rental_listings.drop_duplicates().reset_index(drop=True)

#### Property description

The `property_description` column contains free-text descriptions written by agents or property owners to highlight features and benefits of the listing. While this unstructured text can provide valuable insights, it is often subjective, inconsistent, and difficult to analyse at scale without advanced natural language processing. Since our current analysis focuses on structured, comparable features across listings, we will drop the property_description column.

In [10]:
# Drop the `property_description` column
rental_listings.drop(columns=['property_description'], inplace=True)

#### Property features extraction

The `property_features` column has a section labelled *Points of Interest* which highlights nearby amenities and infrastructure such as schools, hospitals, shopping centres, public transport stops and recreational facilities. It provides a quick view of the property's surroundings and helps to assess its suitability for lifestyle preferences, daily commuting, and community accessibility. 

We remove this section to reduce noise and ensure consistency, as it contains unstructured text, and it may not be directly useful for aggregation without additional geospatial processing. 

In [11]:
# Remove any information following the text 'Points of Interest' in the 'property_features' column
rental_listings["property_features"] = rental_listings["property_features"].str.replace(
    r'Points of Interest.*', '', regex=True
)

# Remove any empty spaces for easier analysis
rental_listings["property_features"] = rental_listings["property_features"].str.strip()

The `property_features` column has a wealth of information about each listing, ranging from the listing number, the type of property, room information, deposit information, occupation date, etc. However, for the purpose of this analysis, we will only extract a subset of relevant features that are most structured and analytically useful where we can reliably quantify and compare across properties. Once the necessary information has been extracted, we will drop the property_features column.

In [12]:
# We define the patterns of the features we want to extract
patterns = {
    'listing_number': r'Listing Number(\d+)',
    'property_type': r'Type of Property\s*(House|Apartment / Flat|Townhouse|Commercial|Industrial|Vacant Land|Farm)',
    'furnished': r'Furnished(Yes|No)',
    'pets_allowed': r'Pets Allowed(Yes|No)',
    'backup_water': r'Backup Water(Yes|No)',
    'bedrooms': r'Bedrooms?\s*(\d+)',
    'bathrooms': r'Bathrooms?\s*(\d+)',
    'kitchen': r'Kitchens?\s*(\d+)',
    'lounge': r'Lounges?\s*(\d+)',
    'dining_room': r'Dining Rooms?\s*(\d+)',
    'parking': r'Parking(\d+)',
    'garden': r'Gardens?\s*(\d+)',
    'pool': r'Pools?\s*(\d+)',
    'garage': r'Garages?\s*(\d+)',
    'erf_size (m²)': r'Erf Size\s*(\d+)\s*m²',
    'floor_area (m²)': r'Floor Area\s*(\d+)\s*m²'
}



# Function to extract features from text
def extract_features(text):
    return {
        field: (match := re.search(pattern, text)
                ) and match.group(1).strip() or None for field, pattern in patterns.items()
    }
    
# Apply extraction row-wise
extracted_df = rental_listings['property_features'].apply(extract_features).apply(pd.Series)

# Combine the original df with the extracted columns
results_df = pd.concat([rental_listings, extracted_df], axis=1)

# Drop the property_features after the extraction is complete
results_df.drop(columns=['property_features'], inplace=True)

# View results
results_df.head()

Unnamed: 0,price,location,property_title,listing_slug,listing_number,property_type,furnished,pets_allowed,backup_water,bedrooms,bathrooms,kitchen,lounge,dining_room,parking,garden,pool,garage,erf_size (m²),floor_area (m²)
0,R 7 000Per Month,"Bhisho, Tyu Tyu North",3 Bedroom House to Rent in Tyu Tyu North,3-bedroom-house-to-rent-in-tyu-tyu-north-11529...,115295115,House,No,No,No,3.0,1.0,1.0,1.0,1.0,1.0,,,,,
1,R 7 000Per Month,"Bhisho, Bhisho Central",3 Bedroom House to Rent in Bhisho Central,3-bedroom-house-to-rent-in-bhisho-central-1153...,115332086,House,No,No,,3.0,1.0,1.0,1.0,,1.0,1.0,,,,
2,R 5 300Per Month,"Bhisho, Tyu Tyu North",2 Bedroom House to Rent in Tyu Tyu North,2-bedroom-house-to-rent-in-tyu-tyu-north-11521...,115210602,House,No,Yes,,2.0,1.0,1.0,1.0,,1.0,1.0,,,,
3,R 7 500Per Month,"Bhisho, Bisho Park",3 Bedroom House to Rent in Bisho Park,3-bedroom-house-to-rent-in-bisho-park-115555490,115555490,House,No,Yes,,,1.0,1.0,,,3.0,,,,,
4,R 5 500Per Month,"Bhisho, Bisho Park",House to Rent in Bisho Park,house-to-rent-in-bisho-park-115525482,115525482,House,,,,,,,,,,,,,,


In [13]:
# Print the number of null values in the `bedrooms` column
results_df["bedrooms"].isnull().sum()

np.int64(11068)

The `property_title` column often contains structured phrases that include the number of bedrooms for residential properties such as apartments, houses and townhouses. This consistency allows for the extraction of bedroom information directly from the title using regular expressions, which can help fill in missing values in the `bedrooms` columns. 

In [14]:
def extract_bedrooms_from_property_title(title):
    if pd.isna(title):
        return None
    match = re.search(r'(\d+)[-\s]?bedroom', title.lower())
    return int(match.group(1)) if match else None

results_df['bedrooms'] = results_df['bedrooms'].fillna(
    results_df.apply(lambda row: extract_bedrooms_from_property_title(row['property_title']) if pd.isna(row['bedrooms']) else row['bedrooms'], axis=1))

# Convert the column to an integer type
results_df['bedrooms'] = pd.to_numeric(results_df['bedrooms'], errors='coerce').astype('Int64')

In [15]:
# Print the number of null values in the `bedrooms` column after the function has been applied
results_df["bedrooms"].isnull().sum()

np.int64(8521)

####  Suburb, city and province extraction

The `property_title` column follows a structure that includes the suburb where the property is located. We can use this to extract the suburb for each listing. 

In [16]:
# Extract the suburb from the property_title column
results_df['suburb'] = results_df['property_title'].str.extract(
    r'to Rent in (.+)', expand=False)

# Remove the trailing numbers from the suburb column
results_df['suburb'] = (
    results_df['suburb'].str.replace(r'-\d+$', '', regex=True).str.strip())

# Drop the property_title column as we have extracted all the information we need
results_df.drop(columns=['property_title'], inplace=True)

# Print the first five rows of the table to see the results
results_df.head()

Unnamed: 0,price,location,listing_slug,listing_number,property_type,furnished,pets_allowed,backup_water,bedrooms,bathrooms,kitchen,lounge,dining_room,parking,garden,pool,garage,erf_size (m²),floor_area (m²),suburb
0,R 7 000Per Month,"Bhisho, Tyu Tyu North",3-bedroom-house-to-rent-in-tyu-tyu-north-11529...,115295115,House,No,No,No,3.0,1.0,1.0,1.0,1.0,1.0,,,,,,Tyu Tyu North
1,R 7 000Per Month,"Bhisho, Bhisho Central",3-bedroom-house-to-rent-in-bhisho-central-1153...,115332086,House,No,No,,3.0,1.0,1.0,1.0,,1.0,1.0,,,,,Bhisho Central
2,R 5 300Per Month,"Bhisho, Tyu Tyu North",2-bedroom-house-to-rent-in-tyu-tyu-north-11521...,115210602,House,No,Yes,,2.0,1.0,1.0,1.0,,1.0,1.0,,,,,Tyu Tyu North
3,R 7 500Per Month,"Bhisho, Bisho Park",3-bedroom-house-to-rent-in-bisho-park-115555490,115555490,House,No,Yes,,3.0,1.0,1.0,,,3.0,,,,,,Bisho Park
4,R 5 500Per Month,"Bhisho, Bisho Park",house-to-rent-in-bisho-park-115525482,115525482,House,,,,,,,,,,,,,,,Bisho Park


We will then load a seperate **city_province** csv file that contains a mapping of cities to their matching provinces to validate and fill in missing province information in the main dataset. 

In [17]:
# Load the CSV that contains the capital_cities with their respective provinces and create a mapping
city_province = pd.read_csv("../data/raw/south_africa_provinces_capitals.csv")

# Crete a mapping from city to province
city_list = city_province['city'].dropna().str.strip().unique()

The `location` column contains inconsistent formats, where it lists both the suburb and city and other times provides a full address. To address this, we use the *city_list* mapping to extract the city by matching it against known entries. If no match is found, the value is replaced with **None**.

In [18]:
def extract_city_from_location(location, city_list):
    if pd.isna(location):
        return None
    
    
    location_lower = location.lower()
    for city in city_list:
        if city.lower() in location_lower:
            return city
    return None

In [19]:
# Apply the extract_city_from_location function to extract the city from the location column
results_df['city'] = results_df['location'].apply(
    lambda loc: extract_city_from_location(loc, city_list))

# Drop the location column as it is redundant
results_df.drop(columns=["location"], inplace=True)

The `city` column contains null values because some entries in the `location` column are empty or incomplete. To address this, we create a mapping of known suburbs to their matching cities and use it to fill in the missing `city` column. 

In [20]:
# Create a mapping of suburb to known cities
suburb_to_city_map = results_df[results_df['city'].notna()].drop_duplicates(subset=['suburb', 'city']).set_index('suburb')['city'].to_dict()

# Fill missing city values based on the suburb
results_df['city'] = results_df.apply(lambda row: suburb_to_city_map.get(row['suburb'], row['city']) if pd.isna(row['city']) else row['city'], axis= 1)   

In [21]:
# Print out the number of null values within the `city` column
results_df['city'].isnull().sum()

np.int64(91)

There are 91 null rows in the city column which account for less than 0.5% of the total dataset. We remove these rows to maintain data quality given their small proportion. 

In [22]:
# Drop the city rows with null values
results_df = results_df.dropna(subset=['city'])
results_df['city'].isnull().sum()

np.int64(0)

We will then merge the dataset with the **city_province.csv** file to assign the matching province to each city. 

In [23]:
# Merge the city with its matching province
results_df = results_df.merge(city_province, on='city', how='left')

The `listing_slug` column contains a URL-friendly string summarising key details about this listing, such as the number of bedrooms, property type, suburb, and a unique listing ID. This information is already available in other columns; we drop `listing_slug` to avoid redundancy. 

In [24]:
# Drop the listing slug column
results_df.drop(columns=['listing_slug'], inplace=True)

#### Price column

The `price` column currently combines the price amount with its associated pricing type (e.g., R 7000Per Month), which complicates aggregation and analysis. To resolve this, we separate the pricing type into a new `price_type` column and extract the numeric value into a new `price (ZAR)` column, removing the currency symbol in the process. The cleaned data is then stored in a new DataFrame called **results_df**. 

In [25]:
# Extract the pricing option after the price amount
results_df['price_type'] = results_df['price'].str.extract(
    r'(Per\s*Month|Per\s*Day|m2|m²|Perm²)', flags=re.IGNORECASE, expand=False
)


# Normalise price_type text (remove spaces, lower case), then map to standard form
results_df['price_type'] = results_df['price_type'].str.replace(
    r'\s+', '', regex=True).str.lower()
results_df['price_type'] = results_df['price_type'].replace({
    'permonth': 'Per Month',
    'perday': 'Per Day',
    'm2': 'Per m²',
    'm²': 'Per m²',
    'perm²': 'Per m²'
})

# Remove spaces inside the price string first
results_df['price_clean'] = results_df['price'].str.replace(
    r'\s+', '', regex=True)

# Now extract the numeric part (digits only, continuous)
results_df['price (ZAR)'] = results_df['price_clean'].str.extract(
    r'R?(\d+)', expand=False)

# Convert the price(ZAR) column to numeric
results_df['price (ZAR)'] = pd.to_numeric(
    results_df['price (ZAR)'], errors='coerce')

# Drop the helper column `price_clean` column
results_df.drop(columns=['price_clean'], inplace=True)

# Then clean blank or whitespace-only price_type entries as before
results_df['price_type'] = results_df['price_type'].replace(
    r'^\s*$', pd.NA, regex=True)

# Preview results
print(results_df[['price', 'price_type', 'price (ZAR)']].head())

              price price_type  price (ZAR)
0  R 7 000Per Month  Per Month       7000.0
1  R 7 000Per Month  Per Month       7000.0
2  R 5 300Per Month  Per Month       5300.0
3  R 7 500Per Month  Per Month       7500.0
4  R 5 500Per Month  Per Month       5500.0


In [26]:
# Get the unique values in the price_type column
results_df['price_type'].value_counts()

price_type
Per Month    12826
Per m²         948
Per Day        143
Name: count, dtype: int64

In [27]:
# Get the number of null values in the price_type column
results_df['price_type'].isnull().sum()

np.int64(12)

There are 12 null values in the `price_type` column, which is unexpected given that the `price` column contains no null entries. To investigate this, we filter the dataset to display the rows where `price_type` is missing. 

In [28]:
# Visualize the rows where price_type is null
results_df[results_df['price_type'].isna()]

Unnamed: 0,price,listing_number,property_type,furnished,pets_allowed,backup_water,bedrooms,bathrooms,kitchen,lounge,...,garden,pool,garage,erf_size (m²),floor_area (m²),suburb,city,province,price_type,price (ZAR)
357,POA,114774738,Industrial,No,,,,,,,...,,,,,,Hilton,Bloemfontein,Free State,,
3728,POA,115605901,House,Yes,No,,4.0,6.0,2.0,2.0,...,1.0,1.0,1.0,,,Bantry Bay,Cape Town,Western Cape,,
3760,POA,115594664,House,Yes,No,,5.0,7.0,2.0,4.0,...,,1.0,1.0,,,Clifton,Cape Town,Western Cape,,
3771,POA,115613951,House,Yes,,,5.0,5.0,2.0,4.0,...,,1.0,1.0,,,Fresnaye,Cape Town,Western Cape,,
3837,POA,115606187,House,Yes,No,,4.0,5.0,1.0,1.0,...,,1.0,2.0,,,Clifton,Cape Town,Western Cape,,
4180,POA,108778518,House,Yes,,,5.0,5.0,1.0,1.0,...,,,,991.0,862.0,Llandudno,Cape Town,Western Cape,,
4587,POA,115118688,House,Yes,No,,5.0,4.0,1.0,,...,,,,530.0,,Camps Bay,Cape Town,Western Cape,,
4610,POA,115153896,House,Yes,No,,5.0,5.0,,,...,,,,,463.0,Clifton,Cape Town,Western Cape,,
4896,POA,110813037,House,Yes,,,8.0,8.0,1.0,1.0,...,,,,,605.0,Llandudno,Cape Town,Western Cape,,
4899,POA,113953599,House,Yes,,,7.0,7.0,,,...,,,,,,Constantia,Cape Town,Western Cape,,


These 12 rows are interesting because the `price` column has the acronym 'POA', which stands for *Price on Application* but there is no price_type such as monthly, per square meter or per day. 

These 12 rows are interesting because the `price` column contains the acronym "POA" (Price on Application), yet the corresponding `price_type` is missing; e.g., values like "Per Month", "Per Square Metre" or "Per Day" are not provided. 

We will then calculate the total number of rows where the `price` column contains "POA", including those with an associated `price_type` and those without.

In [29]:
# Get the number of rows in the dataset in the 'price' column that are equal to 'POA'
poa_mask = results_df['price'].astype(
    str).str.contains("POA", case=False, na=False)
num_poa = poa_mask.sum()

# Print the total number of rows
print(f"There are {num_poa} listings with 'POA' in the price column.")

There are 217 listings with 'POA' in the price column.


There are currently 217 rows containing "POA". To clean this, we will calculate the average price based on `suburb`, `property_type` and `price_type`, and use these averages to impute the missing values. Rows with "POA" without a `price_type` will then be removed. 

In [30]:
# Identify POA rows with valid price_type
poa_mask = results_df['price'].str.lower().str.startswith('poa')
poa_with_type_mask = poa_mask & results_df['price_type'].notna() & (
    results_df['price_type'] != '')

# Calculate averages for the suburb, property_type and price_type
avg_prices = results_df[~poa_mask & results_df['price (ZAR)'].notna()] \
    .groupby(['suburb', 'property_type', 'price_type'])['price (ZAR)'] \
    .apply(lambda x: pd.to_numeric(x, errors='coerce').mean()) \
    .reset_index().rename(columns={'price (ZAR)': 'price_avg'})

# Merge the averages into the original DataFrame
results_df = results_df.merge(
    avg_prices, on=['suburb', 'property_type', 'price_type'], how='left')

# Replace the POA with non-empty price_types with the averages
results_df.loc[poa_with_type_mask,
               'price (ZAR)'] = results_df.loc[poa_with_type_mask, 'price_avg']

# Drop the POA rows without a price type
results_df = results_df[~(poa_mask & ~poa_with_type_mask)]

# Drop the helper column
results_df.drop(columns=['price_avg'], inplace=True)

In [31]:
print("Remaining 'POA':", results_df['price'].str.lower().eq('poa').sum())
print("Remaining nulls 'price (ZAR)':",
      results_df['price (ZAR)'].isna().sum())

Remaining 'POA': 0
Remaining nulls 'price (ZAR)': 12


There are unexpected null values in the `price (ZAR)` column. To investigate, we filter the dataset to display only the rows where the price (in rands) is null

In [32]:
# Get the rows where price (ZAR) is null
results_df[results_df['price (ZAR)'].isna()]

Unnamed: 0,price,listing_number,property_type,furnished,pets_allowed,backup_water,bedrooms,bathrooms,kitchen,lounge,...,garden,pool,garage,erf_size (m²),floor_area (m²),suburb,city,province,price_type,price (ZAR)
4246,POAPer Day,115374518,Apartment / Flat,,,,4.0,4.0,1.0,,...,,1.0,,,,Fresnaye,Cape Town,Western Cape,Per Day,
4586,POAPer Day,112682529,House,Yes,,,7.0,7.0,,,...,,,,,,Zwaanswyk,Cape Town,Western Cape,Per Day,
4836,POAPer Day,111244618,Apartment / Flat,,,,3.0,3.0,,,...,,1.0,,,286.0,Bakoven,Cape Town,Western Cape,Per Day,
4863,POAPer Day,114968205,Apartment / Flat,,,,4.0,4.0,1.0,,...,,1.0,,,,Fresnaye,Cape Town,Western Cape,Per Day,
4991,POAPer Day,111859212,Apartment / Flat,Yes,,,3.0,4.0,1.0,,...,,1.0,,,,Bantry Bay,Cape Town,Western Cape,Per Day,
7450,POAPer m²,115523149,Commercial,,,,,,,,...,,,,,235.0,Heriotdale,Johannesburg,Gauteng,Per m²,
8317,POAPer m²,115418255,Commercial,,,,,,,,...,,,,,260.0,Heriotdale,Johannesburg,Gauteng,Per m²,
8706,POAPer Month,114459245,Industrial,No,,,,,,,...,,,,500.0,500.0,Westgate,Johannesburg,Gauteng,Per Month,
8758,POAPer m²,115517306,Commercial,,,,,,,,...,,,,,225.0,Heriotdale,Johannesburg,Gauteng,Per m²,
12308,POAPer Month,113997397,Industrial,No,,,,,,,...,,,,,,Nelsville,Nelspruit,Mpumalanga,Per Month,


These 12 rows remain because an average based on the combination of `property_type`, `price_type` and `suburb` could not be calculated for them. Since they represent a small subset of the data, we will remove them from the dataset. 

In [33]:
# Drop rows with the price (ZAR) column that are null
results_df = results_df.dropna(subset=['price (ZAR)'])

In [34]:
# Drop the price column as we do not need it anymore
results_df.drop(columns=['price'], inplace=True)

We will now proceed to clean the columns that were previously extracted from the `property_features` column. 

### Pets Allowed column

In [35]:
# Get the total number of null values in the `pets_allowed` column
results_df['pets_allowed'].isnull().sum()

np.int64(10304)

In [36]:
# Get the total number of unique values in the `pets_allowed` column
results_df['pets_allowed'].value_counts()

pets_allowed
No     2598
Yes    1003
Name: count, dtype: int64

We will calculate the mode of the `pets_allowed` column for each property type to identify the most common pet policy associated with each major property category. 

In [37]:
# Calculate mode for each property type
pets_mode = results_df.groupby('property_type')['pets_allowed'].agg(
    lambda x: x.mode().iloc[0] if not x.mode().empty else pd.NA
)

# Print results
print(pets_mode)

property_type
Apartment / Flat     No
Commercial           No
Farm                Yes
House               Yes
Industrial           No
Townhouse            No
Vacant Land          No
Name: pets_allowed, dtype: object


We will then use the mode values for each property type to fill in the missing values in the `pets_allowed` column. 

In [38]:
# Initialise a dictionary that contains the common mode values for each property type
default_pets_value = {
    'Apartment / Flat': 'No', 
    'Commercial': 'No',
    'Farm': 'Yes',
    'House': 'Yes', 
    'Industrial': 'No', 
    'Townhouse': 'No', 
    'Vacant Land': 'No'
}

# Apply the default_pet_value for all null values
results_df['pets_allowed'] = results_df.apply(
    lambda row: default_pets_value[row['property_type']]
    if pd.isna(row['pets_allowed']) and row['property_type'] in default_pets_value
    else row['pets_allowed'],
    axis=1
)

In [39]:
# Verify that no null values exist in the `pets_allowed` column
print(results_df['pets_allowed'].isna().sum())

0


### Furnished column

In [40]:
# Get the number of null values in the `furnished` column
results_df['furnished'].isnull().sum()

np.int64(4635)

In [41]:
# Get the total number of unique value counts in the `furnished` column
results_df['furnished'].value_counts()

furnished
No     7962
Yes    1308
Name: count, dtype: int64

We will calculate the mode of the `furnished` column for each property type to identify the most common furnishing status associated with each major property category.

In [42]:
# Calculate mode for each property type for the furnished column
furnished_mode = results_df.groupby('property_type')['furnished'].agg(
    lambda x: x.mode().iloc[0] if not x.mode().empty else pd.NA
)
print(furnished_mode)

property_type
Apartment / Flat    No
Commercial          No
Farm                No
House               No
Industrial          No
Townhouse           No
Vacant Land         No
Name: furnished, dtype: object


To maintain data consistency, we will fill in the null values in the `furnished` column with *"Unknown"*.

In [43]:
# Fill the null values in the furnished column with "Unknown"
results_df['furnished'] = results_df['furnished'].fillna("Unknown")

In [44]:
# Verify that there are no null values in the `furnished` column
print(results_df['furnished'].isna().sum())

0


### Bedrooms column

In [45]:
# Get the total number of null values in the `bedrooms` column
results_df['bedrooms'].isnull().sum()

np.int64(8449)

In [46]:
# Get the number of unique value counts in the `bedrooms` column
results_df['bedrooms'].value_counts()

bedrooms
2     1825
1     1446
3     1178
5      403
4      347
0      177
6       45
7       17
8        8
9        4
11       2
13       2
12       1
10       1
Name: count, dtype: Int64

For the `bedrooms` column, we calculate the median values for each property type to understand the central tendency within the dataset.

In [47]:
# Calculate the median number of bedrooms for each property_type
bedroom_medians = results_df.groupby('property_type')['bedrooms'].median()

def fill_bedrooms(row):
    if pd.isna(row['bedrooms']):
        median = bedroom_medians.get(row['property_type'], np.nan)
        return median if pd.notna(median) else 0
    return row['bedrooms']

# Apply function
results_df['bedrooms'] = results_df.apply(fill_bedrooms, axis=1)

# Convert results to integer
results_df['bedrooms'] = results_df['bedrooms'].round().astype('Int64')

In [48]:
# Verify there are no null values in the `bedrooms` column
results_df['bedrooms'].isnull().sum()

np.int64(0)

### Bathrooms column

In [49]:
# Get the total number of null values in the `bathrooms` column
results_df['bathrooms'].isnull().sum()

np.int64(6113)

In [50]:
# Get the total number of unique value counts in the `bathrooms` column
results_df['bathrooms'].value_counts()

bathrooms
1      3937
2      2662
3       529
4       347
6        99
5        90
8        41
7        26
10       21
0        13
12        6
15        5
9         4
20        3
14        2
16        2
11        1
13        1
30        1
200       1
160       1
Name: count, dtype: int64

There appear to be outliers in the `bathrooms` column, with a maximum value of 200. To further investigate, we will identify and visualise the number of outlier entries per property type, using 9 bathrooms as the threshold. 

In [51]:
# Ensure 'bathrooms' is numeric
results_df['bathrooms'] = pd.to_numeric(results_df['bathrooms'], errors='coerce').astype('Int64')

# Filter listings with bathrooms > 9
bathroom_outliers = results_df[results_df['bathrooms'] > 9]

# Group by property_type and count
bathroom_summary = bathroom_outliers.groupby(
    'property_type')['bathrooms'].count().reset_index()

# Rename column for clarity
bathroom_summary.columns = ['property_type', 'Count with >10 bathrooms']

# Sort by count descending
bathroom_summary = bathroom_summary.sort_values(
    by='Count with >10 bathrooms', ascending=False)

print(bathroom_summary)

  property_type  Count with >10 bathrooms
0    Commercial                        21
2    Industrial                        19
1         House                         4


In total, there are 44 outlier entries in the `bathrooms` column, with the highest count found in the commercial property type. To address those, we replace the outlier values with the median number of bathrooms for each property type. If a median value cannot be determined (i.e., is missing), we substitute it with zero. 

For the remaining null values, we impute them using the median number of bathrooms for each property type.

In [52]:
# Set the threshold for outlier bathroom numbers
threshold = 9 

# Define the outlier bathrooms as specified by the threshold
outliner_bathrooms = results_df[results_df['bathrooms'] > threshold]

bathroom_median = results_df[results_df['bathrooms'] <= threshold].groupby('property_type')['bathrooms'].median()

bathroom_median = bathroom_median.fillna(0)

for idx, row in outliner_bathrooms.iterrows():
    prop_type = row['property_type']
    median_val = bathroom_median.get(prop_type, 0)
    results_df.at[idx, 'bathrooms'] = median_val

In [53]:
# Fill NaN bathroom values with property-type median, or 0 if still missing
def fill_missing_bathrooms(row):
    if pd.isna(row['bathrooms']):
        return bathroom_median.get(row['property_type'], 0)
    return row['bathrooms']

# Apply the above function to the `bathrooms` column
results_df['bathrooms'] = results_df.apply(fill_missing_bathrooms, axis=1)

# Convert results to integers
results_df['bathrooms'] = results_df['bathrooms'].round().astype('Int64')

In [54]:
# Verify there are no null values in the `bathrooms` column
results_df['bathrooms'].isnull().sum()

np.int64(0)

### Kitchen column

In [55]:
# Get the total number of null values in the `kitchen` column
results_df['kitchen'].isnull().sum()

np.int64(7387)

In [56]:
# Get the unique value counts for the `kitchen` column
results_df['kitchen'].value_counts()

kitchen
1     6116
2      304
4       44
3       40
5       10
6        2
21       1
10       1
Name: count, dtype: int64

For the `kitchen` column, we calculate the median values for each property type to understand the central tendency within the dataset.

In [57]:
# Ensure that the kitchen column is numeric
results_df['kitchen'] = pd.to_numeric(
    results_df['kitchen'], errors='coerce').astype('Int64')


# Group by property type and calculate median
median_kitchens = results_df.groupby('property_type')[
    'kitchen'].median().reset_index()

# Rename columns for clarity
median_kitchens.columns = ['property_type', 'Median kitchen']

# Sort by median descending (optional)
median_kitchens = median_kitchens.sort_values(
    by='Median kitchen', ascending=False)

print(median_kitchens)

      property_type  Median kitchen
0  Apartment / Flat             1.0
1        Commercial             1.0
2              Farm             1.0
3             House             1.0
4        Industrial             1.0
5         Townhouse             1.0
6       Vacant Land            <NA>


The vacant land property type has a median kitchen value of null, which is expected given its nature. There are also outliers in the dataset, with the maximum number of kitchens reaching 10. To address this, we replace the outliers with the median kitchen value for each property type, defaulting to zero if the median is unavailable. 

For the remaining null values, we impute the Kitchen column using the median value for the respective property type, defaulting to 0 again if the median is unavailable.

In [58]:
kitchen_threshold = 6

# Identify outliers
kitchen_outliers = results_df[results_df['kitchen'] > kitchen_threshold]

# Compute medians and replace NaNs with 0
kitchen_median = results_df[results_df['kitchen'] <= kitchen_threshold] \
    .groupby('property_type')['kitchen'].median().fillna(0)

# Replace outlier values
for idx, row in kitchen_outliers.iterrows():
    prop_type = row['property_type']
    median_val = kitchen_median.get(prop_type, 1)
    results_df.at[idx, 'kitchen'] = median_val

In [59]:
# Fill NaN kitchen values with property-type median, or 0 if still missing
def fill_missing_kitchen(row):
    if pd.isna(row['kitchen']):
        return kitchen_median.get(row['property_type'], 0)
    return row['kitchen']

# Apply the function
results_df['kitchen'] = results_df.apply(fill_missing_kitchen, axis=1)

# Convert results to integers
results_df['kitchen'] = results_df['kitchen'].round().astype('Int64')

In [60]:
# Verify that there are no null values
results_df['kitchen'].isnull().sum()

np.int64(0)

### Lounge column

In [61]:
# Get the total number of null values in the `lounge` column
results_df['lounge'].isnull().sum()

np.int64(12299)

In [62]:
# Get the unique value counts of the `Lounge` column
results_df['lounge'].value_counts()

lounge
1    1575
2      29
3       2
Name: count, dtype: int64

For the `lounge` column, we calculate the median values for each property type to understand the central tendency within the dataset.

In [63]:
# Ensure that the lounge column is numeric
results_df['lounge'] = pd.to_numeric(
    results_df['lounge'], errors='coerce').astype('Int64')


# Group by property type and calculate median
median_lounge = results_df.groupby('property_type')[
    'lounge'].median().reset_index()

# Rename columns for clarity
median_lounge.columns = ['property_type', 'Median lounge']

# Sort by median descending (optional)
median_lounge = median_lounge.sort_values(
    by='Median lounge', ascending=False)

print(median_lounge)

      property_type  Median lounge
0  Apartment / Flat            1.0
1        Commercial            1.0
2              Farm            1.0
3             House            1.0
5         Townhouse            1.0
4        Industrial           <NA>
6       Vacant Land           <NA>


The vacant land property type has a median lounge value of null, which is expected given the nature of the property. Since there are no apparent outliers in the dataset, we focus on addressing the remaining null values. These are imputed using the median lounge value for each property type, defaulting to 0 where the median is unavailable.

In [64]:
# Compute median lounges per property type
lounge_median = results_df.groupby('property_type')['lounge'].median()

# Fill NaNs in median with 0
median_lounge = lounge_median.fillna(0)

# Function using the correctly filled median
def fill_lounge(row):
    if pd.isna(row['lounge']):
        return median_lounge.get(row['property_type'], 0)
    return row['lounge']


# Apply the function
results_df['lounge'] = results_df.apply(fill_lounge, axis=1)

# Convert results to integers
results_df['lounge'] = results_df['lounge'].round().astype('Int64')

In [65]:
# Verify that there are no null values present
results_df['lounge'].isnull().sum()

np.int64(0)

### Dining Room column

In [66]:
# Get the total number of null values in the `dining_room` column
results_df['dining_room'].isnull().sum()

np.int64(13103)

In [67]:
# Get the unique value counts in the `Dining Room` column
results_df['dining_room'].value_counts()

dining_room
1    795
2      6
3      1
Name: count, dtype: int64

For the `dining_room` column, we calculate the median values for each property type to understand the central tendency within the dataset.

In [68]:
# Ensure that the dining_room column is numeric
results_df['dining_room'] = pd.to_numeric(
    results_df['dining_room'], errors='coerce').astype('Int64')


# Group by property type and calculate median
median_dining_room = results_df.groupby('property_type')[
    'dining_room'].median().reset_index()

# Rename columns for clarity
median_dining_room.columns = ['property_type', 'median_dining_room']

# Sort by median descending (optional)
median_dining_room = median_dining_room.sort_values(
    by='median_dining_room', ascending=False)

print(median_dining_room)

      property_type  median_dining_room
0  Apartment / Flat                 1.0
1        Commercial                 1.0
3             House                 1.0
5         Townhouse                 1.0
2              Farm                <NA>
4        Industrial                <NA>
6       Vacant Land                <NA>


The vacant land and industrial property types have a median dining room value of null, which is expected given the nature of the property. Since there are no apparent outliers in the dataset, we focus on addressing the remaining null values. These are imputed using each property type's median dining room value, defaulting to 0 where the median is unavailable.

In [69]:
# Compute median lounges per property type
dining_room_median = results_df.groupby('property_type')['dining_room'].median()

# Fill NaNs in median with 0
median_dining_room = dining_room_median.fillna(0)

# Function using the correctly filled median
def fill_lounge(row):
    if pd.isna(row['dining_room']):
        return median_dining_room.get(row['property_type'], 0)
    return row['dining_room']


# Apply the function to impute the null values
results_df['dining_room'] = results_df.apply(fill_lounge, axis=1)

# Convert the column to integer
results_df['dining_room'] = results_df['dining_room'].round().astype('Int64')

In [70]:
# Verify that no null values exist in the column
results_df['dining_room'].isnull().sum()

np.int64(0)

### Garage column

In [71]:
# Get the total number of null values in the `garage` dataset
results_df['garage'].isnull().sum()

np.int64(13242)

In [72]:
# Get the unique value counts for the `garage` column
results_df['garage'].value_counts()

garage
1     388
2     233
3      18
4      12
5       5
40      2
14      1
20      1
10      1
6       1
9       1
Name: count, dtype: int64

For the `garage` column, we calculate the median values for each property type to understand the central tendency within the dataset.

In [73]:
# Ensure that the garage column is numeric
results_df['garage'] = pd.to_numeric(
    results_df['garage'], errors='coerce').astype('Int64')


# Group by property type and calculate median
median_garage = results_df.groupby('property_type')[
    'garage'].median().reset_index()

# Rename columns for clarity
median_garage.columns = ['property_type', 'median_garage']

# Sort by median descending (optional)
median_garage = median_garage.sort_values(
    by='median_garage', ascending=False)

print(median_garage)

      property_type  median_garage
4        Industrial           30.0
1        Commercial           12.0
3             House            2.0
0  Apartment / Flat            1.0
5         Townhouse            1.0
2              Farm           <NA>
6       Vacant Land           <NA>


The vacant land and farm property types have a median garage value of null, which is expected given their nature. There are also outliers in the dataset, with the maximum number of garages reaching 40. The median values also support this observation, with the median value of the industrial type being 30, which is very high. 

To address this, we replace the outliers with the median garage value for each property type, defaulting to zero if the median is unavailable. 

For the remaining null values, we impute the garage column using the median value for the respective property type, defaulting to 0 again if the median is unavailable.

In [74]:
# Define the threshold
garage_threshold = 5

# Identify outliers
garage_outliers = results_df[results_df['garage'] > garage_threshold]

# Compute medians and replace NaNs with 0
garage_median = results_df[results_df['garage'] <= garage_threshold].groupby('property_type')['garage'].median().fillna(0)

# Replace outlier values
for idx, row in garage_outliers.iterrows():
    prop_type = row['property_type']
    median_val = garage_median.get(prop_type, 1)
    results_df.at[idx, 'garage'] = median_val

In [75]:
# Fill NaN kitchen values with property-type median, or 0 if still missing
def fill_missing_garage(row):
    if pd.isna(row['garage']):
        return garage_median.get(row['property_type'], 0)
    return row['garage']

# Apply the function
results_df['garage'] = results_df.apply(fill_missing_garage, axis=1)

# Convert the column to integer
results_df['garage'] = results_df['garage'].round().astype('Int64')

In [76]:
# Verify that there are no null values
results_df['garage'].isnull().sum()

np.int64(0)

### Garden column

In [77]:
# Get the total number of null values in the `garden` column
results_df['garden'].isnull().sum()

np.int64(12622)

In [78]:
# Get the unique value counts
results_df['garden'].value_counts()

garden
1     1274
2        7
23       1
3        1
Name: count, dtype: int64

For the `garden` column, we calculate the median values for each property type to understand the central tendency within the dataset.

In [79]:
# Ensure that the garden column is numeric
results_df['garden'] = pd.to_numeric(
    results_df['garden'], errors='coerce').astype('Int64')


# Group by property type and calculate median
median_garden = results_df.groupby('property_type')[
    'garden'].median().reset_index()

# Rename columns for clarity
median_garden.columns = ['property_type', 'median_garden']

# Sort by median descending (optional)
median_garden = median_garden.sort_values(
    by='median_garden', ascending=False)

print(median_garden)

      property_type  median_garden
0  Apartment / Flat            1.0
1        Commercial            1.0
2              Farm            1.0
3             House            1.0
4        Industrial            1.0
5         Townhouse            1.0
6       Vacant Land            1.0


The median values for all property types are the same, which is unsurprising, as most properties are expected to have at least one garden. There are also outliers in the dataset, with the maximum number of gardens reaching 23. To address this, we replace the outliers with the median kitchen value for each property type, defaulting to zero if the median is unavailable. 

For the remaining null values, we impute the garden column using the median value for the respective property type, defaulting to 0 again if the median is unavailable.

In [80]:
garden_threshold = 3

# Identify outliers
garden_outliers = results_df[results_df['garden'] > garden_threshold]

# Compute medians and replace NaNs with 0
garden_median = results_df[results_df['garden'] <= garden_threshold].groupby(
    'property_type')['garden'].median().fillna(0)

# Replace outlier values
for idx, row in garden_outliers.iterrows():
    prop_type = row['property_type']
    median_val = garden_median.get(prop_type, 1)
    results_df.at[idx, 'garden'] = median_val

In [81]:
# Fill NaN kitchen values with property-type median, or 0 if still missing
def fill_missing_garden(row):
    if pd.isna(row['garden']):
        return garage_median.get(row['property_type'], 0)
    return row['garden']

# Apply the function
results_df['garden'] = results_df.apply(fill_missing_garden, axis=1)

# Convert the column to integers
results_df['garden'] = results_df['garden'].round().astype('Int64')

In [82]:
# Verify that there are no null values
results_df['garden'].isnull().sum()

np.int64(0)

### Pool column

In [83]:
# Get the total number of null values in the `pool` column
results_df['pool'].isnull().sum()

np.int64(13175)

In [84]:
# Get the unique value counts
results_df['pool'].value_counts()

pool
1    719
2     10
3      1
Name: count, dtype: int64

For the `pool` column, we calculate the median values for each property type to understand the central tendency within the dataset.

In [85]:
# Ensure that the pool column is numeric
results_df['pool'] = pd.to_numeric(
    results_df['pool'], errors='coerce').astype('Int64')


# Group by property type and calculate median
median_pool = results_df.groupby('property_type')[
    'pool'].median().reset_index()

# Rename columns for clarity
median_pool.columns = ['property_type', 'median_pool']

# Sort by median descending (optional)
median_pool = median_pool.sort_values(
    by='median_pool', ascending=False)

print(median_pool)

      property_type  median_pool
0  Apartment / Flat          1.0
1        Commercial          1.0
3             House          1.0
5         Townhouse          1.0
2              Farm         <NA>
4        Industrial         <NA>
6       Vacant Land         <NA>


The median values for the farm, industrial, and vacant land property types are null, which is expected given the nature of these properties. For the missing values, we will impute zeros, as a pool is typically a prominent feature intended to attract customers—if it were present, it would likely have been explicitly listed.

In [86]:
# Impute the null values with zero
results_df['pool'] = results_df['pool'].fillna(0)

In [87]:
# Verify that there are no null values
results_df['pool'].isnull().sum()

np.int64(0)

### Parking column

In [88]:
# Get the total number of null values in the `parking` column
results_df['parking'].isnull().sum()

np.int64(8477)

In [89]:
# Get the unique value count
results_df['parking'].value_counts()

parking
1      2361
2      1122
4       385
3       242
5       169
       ... 
65        1
137       1
175       1
160       1
66        1
Name: count, Length: 94, dtype: int64

For the `parking` column, we calculate the median values for each property type to understand the central tendency within the dataset.

In [90]:
# Ensure that the parking column is numeric
results_df['parking'] = pd.to_numeric(
    results_df['parking'], errors='coerce').astype('Int64')


# Group by property type and calculate median
median_parking = results_df.groupby('property_type')[
    'parking'].median().reset_index()

# Rename columns for clarity
median_parking.columns = ['property_type', 'median_parking']

# Sort by median descending (optional)
median_parking = median_parking.sort_values(
    by='median_parking', ascending=False)

print(median_parking)

      property_type  median_parking
1        Commercial             6.0
4        Industrial             5.0
2              Farm             2.0
3             House             2.0
0  Apartment / Flat             1.0
5         Townhouse             1.0
6       Vacant Land             1.0


There are also outliers in the dataset, with the maximum number of parking reaching 3050. The median values also support this observation, with the median value of the commercial type being 6, which is very high. 

To address this, we replace the outliers with the median parking value for each property type, defaulting to zero if the median is unavailable. 

For the remaining null values, we impute the parking column using the median value for the respective property type, defaulting to 0 again if the median is unavailable.

In [91]:
# Define the threshold
parking_threshold = 5

# Identify outliers
parking_outliers = results_df[results_df['parking'] > parking_threshold]

# Compute medians and replace NaNs with 0
parking_median = results_df[results_df['parking'] <= parking_threshold].groupby(
    'property_type')['parking'].median().fillna(0)

# Replace outlier values
for idx, row in parking_outliers.iterrows():
    prop_type = row['property_type']
    median_val = parking_median.get(prop_type, 1)
    results_df.at[idx, 'parking'] = median_val

In [92]:
# Fill NaN kitchen values with property-type median, or 0 if still missing
def fill_missing_parking(row):
    if pd.isna(row['parking']):
        return garage_median.get(row['property_type'], 0)
    return row['parking']

# Apply the function
results_df['parking'] = results_df.apply(fill_missing_parking, axis=1)

# Convert the column to integer type
results_df['parking'] = results_df['parking'].round().astype('Int64')

In [93]:
# Verify that null values are not present anymore.
results_df['parking'].isnull().sum()

np.int64(0)

### Backup Water column

In [94]:
# Get the total number of null values in the `backup_water` column
results_df['backup_water'].isnull().sum()

np.int64(12410)

In [95]:
# Get the unique value counts
results_df['backup_water'].value_counts()

backup_water
Yes    757
No     738
Name: count, dtype: int64

We will calculate the mode of the `backup_water` column for each property type to determine the most common backup water availability status associated with each major property category.

In [96]:
# Calculate mode for each property type for the furnished column
backup_water_mode = results_df.groupby('property_type')['backup_water'].agg(
    lambda x: x.mode().iloc[0] if not x.mode().empty else pd.NA
)
print(backup_water_mode)

property_type
Apartment / Flat      No
Commercial           Yes
Farm                <NA>
House                 No
Industrial            No
Townhouse             No
Vacant Land           No
Name: backup_water, dtype: object


To maintain data consistency, we will fill in the null values in the `backup_water` column with *"Unknown"*.

In [97]:
# Impute null values with "Unknown"
results_df['backup_water'] = results_df['backup_water'].fillna('Unknown')

In [98]:
# Verify no null values exist
results_df['backup_water'].isnull().sum()

np.int64(0)

### Floor Area column

In [99]:
# Get the total number of null values in the `floor_area column
results_df['floor_area (m²)'].isnull().sum()

np.int64(6234)

In [100]:
# Get the unique value counts
results_df['floor_area (m²)'].value_counts()

floor_area (m²)
300    105
30      93
50      90
100     89
60      78
      ... 
783      1
577      1
635      1
990      1
974      1
Name: count, Length: 764, dtype: int64

To impute the missing values in the `floor_area (m²)` column, we apply a hierarchical group-based median imputation strategy. First, we convert the column to a numeric data type to handle any non-numeric entries safely. We then calculate the median floor area for each combination of property_type and city, as floor space often varies based on both the type of property and its location.

If a specific (property_type, city) pair does not have enough data to calculate a median, we fall back to using the overall median for that property_type. Finally, if the property type itself has no median (e.g., all values are missing), we assign a default value of 0. This approach ensures that the imputed values remain contextually relevant and avoid skewing analyses with arbitrary values.

In [101]:
# Ensure that the floor_area (m²) column is numeric
results_df['floor_area (m²)'] = pd.to_numeric(
    results_df['floor_area (m²)'], errors='coerce').astype('Int64')


# Group by both property_type and city for more specific medians
floor_size_median = results_df.groupby(['property_type', 'city'])['floor_area (m²)'].median()

# Fallback: Overall median by property type
floor_size_fallback = results_df.groupby('property_type')['floor_area (m²)'].median()

def fill_erf_size(row):
    if pd.isna(row['floor_area (m²)']):
        val = floor_size_median.get((row['property_type'], row['city']), np.nan)
        if pd.isna(val):
            val = floor_size_fallback.get(row['property_type'], 0)
        return val
    return row['floor_area (m²)']

results_df['floor_area (m²)'] = results_df.apply(fill_erf_size, axis=1)

In [102]:
# Verify that null values do not exist
results_df['floor_area (m²)'].isnull().sum()

np.int64(0)

### Erf Size column

In [103]:
# Get the total number of null values in the `erf_size` column
results_df['erf_size (m²)'].isnull().sum()

np.int64(10801)

In [104]:
# Get the unique value counts
results_df['erf_size (m²)'].value_counts()

erf_size (m²)
495    68
500    40
250    39
100    34
800    34
       ..
922     1
689     1
877     1
474     1
416     1
Name: count, Length: 716, dtype: int64

To impute the missing values in the `erf_size (m²)` column, we also apply a hierarchical group-based median imputation strategy. First, we convert the column to a numeric data type to handle any non-numeric entries safely. We then calculate the median floor area for each combination of property_type and city, as floor space often varies based on both the type of property and its location.

If a specific (property_type, city) pair does not have enough data to calculate a median, we fall back to using the overall median for that property_type. Finally, if the property type itself has no median (e.g., all values are missing), we assign a default value of 0. This approach ensures that the imputed values remain contextually relevant and avoid skewing analyses with arbitrary values.

In [105]:
# Convert to numeric (coerce errors to NaN)
results_df['erf_size (m²)'] = pd.to_numeric(
    results_df['erf_size (m²)'], errors='coerce')

#Calculate medians
erf_medians = results_df.groupby(['property_type', 'city'])[
    'erf_size (m²)'].median()
erf_medians_fallback = results_df.groupby(
    'property_type')['erf_size (m²)'].median()

# Define the fill function
def fill_erf_size(row):
    if pd.isna(row['erf_size (m²)']):
        val = erf_medians.get((row['property_type'], row['city']), np.nan)
        if pd.isna(val):
            val = erf_medians_fallback.get(row['property_type'], 0)
        return val
    return row['erf_size (m²)']


# Apply the fill function
results_df['erf_size (m²)'] = results_df.apply(fill_erf_size, axis=1)

In [106]:
# Verify that no null values exist
results_df['erf_size (m²)'].isnull().sum()

np.int64(3)

In [107]:
# Visualise the rows that are null in the `erf_size` column
results_df[results_df['erf_size (m²)'].isna()]

Unnamed: 0,listing_number,property_type,furnished,pets_allowed,backup_water,bedrooms,bathrooms,kitchen,lounge,dining_room,...,garden,pool,garage,erf_size (m²),floor_area (m²),suburb,city,province,price_type,price (ZAR)
64,115727102,Farm,No,Yes,Unknown,4,3,1,1,0,...,0,0,0,,100.0,Olive Hill,Bloemfontein,Free State,Per Month,12450.0
12321,114729501,Farm,Unknown,Yes,Unknown,5,5,1,1,0,...,0,0,0,,100.0,Nelspruit Rural,Nelspruit,Mpumalanga,Per Month,24000.0
13895,115606963,Farm,Unknown,Yes,Unknown,2,1,1,1,0,...,1,0,0,,100.0,Kalkfontein AH,Polokwane,Limpopo,Per Month,5600.0


In [108]:
# Visualise the rows where the property_type equals 'Farm'
results_df[results_df['property_type'] == 'Farm']

Unnamed: 0,listing_number,property_type,furnished,pets_allowed,backup_water,bedrooms,bathrooms,kitchen,lounge,dining_room,...,garden,pool,garage,erf_size (m²),floor_area (m²),suburb,city,province,price_type,price (ZAR)
64,115727102,Farm,No,Yes,Unknown,4,3,1,1,0,...,0,0,0,,100.0,Olive Hill,Bloemfontein,Free State,Per Month,12450.0
12321,114729501,Farm,Unknown,Yes,Unknown,5,5,1,1,0,...,0,0,0,,100.0,Nelspruit Rural,Nelspruit,Mpumalanga,Per Month,24000.0
13895,115606963,Farm,Unknown,Yes,Unknown,2,1,1,1,0,...,1,0,0,,100.0,Kalkfontein AH,Polokwane,Limpopo,Per Month,5600.0


The only records with null values in erf_size (m²) belong to the "Farm" property type. Since these are the only farm listings available, we avoid dropping them. Instead, we impute a conservative placeholder value (e.g., 1000 m²), based on knowledge that farm properties tend to be substantially larger. This allows us to retain these listings in the analysis while maintaining consistency in the dataset.

In [109]:
# Fill in the null values
results_df['erf_size (m²)'].fillna(1000, inplace=True)

In [110]:
# Print a summary of the cleaned dataset
results_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13905 entries, 0 to 13928
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   listing_number   13905 non-null  object 
 1   property_type    13905 non-null  object 
 2   furnished        13905 non-null  object 
 3   pets_allowed     13905 non-null  object 
 4   backup_water     13905 non-null  object 
 5   bedrooms         13905 non-null  Int64  
 6   bathrooms        13905 non-null  Int64  
 7   kitchen          13905 non-null  Int64  
 8   lounge           13905 non-null  Int64  
 9   dining_room      13905 non-null  Int64  
 10  parking          13905 non-null  Int64  
 11  garden           13905 non-null  Int64  
 12  pool             13905 non-null  Int64  
 13  garage           13905 non-null  Int64  
 14  erf_size (m²)    13905 non-null  float64
 15  floor_area (m²)  13905 non-null  float64
 16  suburb           13905 non-null  object 
 17  city             

## 4. Save the transformed data
[Back to Table of Contents](#table-of-contents)

---

| ⚡ Description: Storing our pre-processed data ⚡ |
| :--------------------------- |

---

In [111]:
# Define the file path
save_path = ("../data/processed/cleaned_rental_listings.csv")

# Save to CSV
results_df.to_csv(save_path, index=False)