# <center> 🧽 Data Cleaning Part 1 </center>

## Table of Contents

- [1. Introduction](#1-introduction)
- [2. Loading Data](#2-loading-data)
- [3. Exploratory Data Analysis](#3-exploratory-data-analysis)

## 1. Introduction
[Back to Table of Contents](#table-of-contents)

---

| ⚡ Description: Introducing the project ⚡ |
| :--------------------------- |

---

The South African property market is as diverse as its nine provinces, which each capital city offering a distinctive combination of lifestyle, econonomic activity, and real estate dynamics. In this project, we focus on analyzing property listings across South Africa's provincial capitals to uncover trends, identify investment opportunites, and evaluate affordability across regions.

Real estate decisions, whether to buy, rent or invest rely heavily on accurate, data-driven insights. This notebook is part of a broader project aimed at equipping investors, analysts, and policy makers with a clearer view of the current landscape. By leveraging web-scraped data from online property listings, we examine key features such as property types, pricing, location, and amenities. 

Through careful data cleaning, exploratory data analysis, and visualization, we aim to:

- Compare rental and sale prices across capital cities
- Identify supply-demand dynamics
- Analyse the affordability and investment potential of different regions
- Explore how property features and economic context influence pricing

The findings from this notebook feed into a Power BI dashboard that allows for dynamic and interactive exploration of property trends across the country. 

This notebook documents each step of the analysis pipeline -starting from raw data processing to uncovering insights to ensure transparency, reproducibility, and a foundation for further exploration

## 2. Loading Data
[Back to Table of Contents](#table-of-contents)

---
    
| ⚡ Description: Loading the data in our environment ⚡ |
| :--------------------------- |

---

### Import packages

In [1]:
# Import packages
import re

import numpy as np
import pandas as pd

In [2]:
# Load the rental listings dataset
rental_listings = pd.read_csv("../data/raw/rental_listings_for_capital_cities.csv")

# Display the first few rows of the dataset and it's shape
display("Rentals", rental_listings.head(), rental_listings.shape)

'Rentals'

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug
0,R 7 000Per Month,"Bhisho, Tyu Tyu North",3 Bedroom House to Rent in Tyu Tyu North,RE/MAX Border is proud to present this 3-bedro...,Property OverviewListing Number115295115Type o...,3-bedroom-house-to-rent-in-tyu-tyu-north-11529...
1,R 7 000Per Month,"Bhisho, Bhisho Central",3 Bedroom House to Rent in Bhisho Central,How CONVENIENT is this flat directly opposite ...,Property OverviewListing Number115332086Type o...,3-bedroom-house-to-rent-in-bhisho-central-1153...
2,R 5 300Per Month,"Bhisho, Tyu Tyu North",2 Bedroom House to Rent in Tyu Tyu North,Comfortable Home for a family which is availab...,Property OverviewListing Number115210602Type o...,2-bedroom-house-to-rent-in-tyu-tyu-north-11521...
3,,,,,,house-to-rent-in-bisho-park-101731048
4,R 7 500Per Month,"Bhisho, Bisho Park",3 Bedroom House to Rent in Bisho Park,BetterProp is thrilled to present this well-ma...,Property OverviewListing Number115555490Type o...,3-bedroom-house-to-rent-in-bisho-park-115555490


(15324, 6)

## 3. Exploratory Data Analysis
[Back to Table of Contents](#table-of-contents)

---
    
| ⚡ Description: Exploring the data ⚡ |
| :--------------------------- |

---

In [3]:
# Print a summary of the dataset
rental_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15324 entries, 0 to 15323
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   price                 15323 non-null  object
 1   location              8189 non-null   object
 2   property_title        15323 non-null  object
 3   property_description  15323 non-null  object
 4   property_features     15323 non-null  object
 5   listing_slug          15324 non-null  object
dtypes: object(6)
memory usage: 718.4+ KB


In [4]:
# Print the total number of null values in the dataset
rental_null_values = rental_listings.isnull().sum().sum()
print(f"There are {rental_null_values} null values in the rental dataset")

There are 7139 null values in the rental dataset


In [5]:
# Print the number of null values in each column
rental_listings.isnull().sum()

price                      1
location                7135
property_title             1
property_description       1
property_features          1
listing_slug               0
dtype: int64

In [None]:
# Print the rows that have null values in four columns or more
rental_listings[rental_listings.isnull().sum(axis=1) >= 4]

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug
3,,,,,,house-to-rent-in-bisho-park-101731048


In [None]:
# Drop the rows that have null values in four columns or more
rental_listings.drop(
    rental_listings[rental_listings.isnull().sum(axis=1) >= 2].index, inplace=True)

In [8]:
# Get the total number of duplicate rows in the dataset
rental_duplicates = rental_listings[rental_listings.duplicated()]
num_rental_duplicates = len(rental_duplicates)
print(f"There are {num_rental_duplicates} duplicates in the rental dataset ")

There are 1303 duplicates in the rental dataset 


In [9]:
# Drop duplicate rows and keep the first occurrence
rental_listings = rental_listings.drop_duplicates().reset_index(drop=True)

#### Property features extraction

The `property_features` column has a section labelled *Points of Interest* which highlights nearby amenities and infrastructure such as schools, hospitals, shopping centers, public transport stops and recreational facilites. It provides a quick view of the property's surroundings and helps to assess its suitability for lifestyle preferences, daily commuting, and community accessbility. 

We remove this section to reduce noise and ensure consistency as it contains unstructured text and it may not be directly useful for aggregation without additional geospatial processing. 

In [None]:
# Remove any information following the text 'Points of Interest' in the 'property_features' column
rental_listings["property_features"] = rental_listings["property_features"].str.replace(
    r'Points of Interest.*', '', regex=True
)

# Remove any empty spaces for easier analysis
rental_listings["property_features"] = rental_listings["property_features"].str.strip()

The `property_features` column has a wealth of information about each listing ranging from the listing number, the type of property, room information, deposit information, occupation date etc. however for the purpose of this analysis, we will only extract a subset of relevant features that are most structured and analytically useful where we can reliably quanify and compare across properties. 

In [None]:
# We define the patterns of the features we want to extract
patterns = {
    'Listing Number': r'Listing Number(\d+)',
    'Type of Property': r'Type of Property\s*(House|Apartment / Flat|Townhouse|Commercial|Industrial|Vacant Land|Farm)',
    'Pets Allowed': r'Pets Allowed(Yes|No)',
    'Furnished': r'Furnished(Yes|No)',
    'Bedrooms': r'Bedrooms?\s*(\d+)',
    'Bathrooms': r'Bathrooms?\s*(\d+)',
    'Kitchen': r'Kitchens?\s*(\d+)',
    'Lounge': r'Lounges?\s*(\d+)',
    'Dining Room': r'Dining Room(\d+)',
    'Parking': r'Parking(\d+)',
    'Backup Water': r'Backup Water(Yes|No)',
    'Erf Size (m²)': r'Erf Size\s*(\d+)\s*m²',
    'Floor Area (m²)': r'Floor Area\s*(\d+)\s*m²',
    'Garden': r'Gardens?\s*(\d+)',
    'Pool': r'Pools?\s*(\d+)',
    'Garage': r'Garages?\s*(\d+)'
}



# Function to extract features from text
def extract_features(text):
    return {
        field: (match := re.search(pattern, text)
                ) and match.group(1).strip() or None for field, pattern in patterns.items()
    }
    
# Apply extraction row-wise
extracted_df = rental_listings['property_features'].apply(extract_features).apply(pd.Series)

# Combine original df with the extracted columns
results_df = pd.concat([rental_listings, extracted_df], axis=1)

# View result
results_df.head()

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug,Listing Number,Type of Property,Pets Allowed,Furnished,...,Kitchen,Lounge,Dining Room,Parking,Backup Water,Erf Size (m²),Floor Area (m²),Garden,Pool,Garage
0,R 7 000Per Month,"Bhisho, Tyu Tyu North",3 Bedroom House to Rent in Tyu Tyu North,RE/MAX Border is proud to present this 3-bedro...,Property OverviewListing Number115295115Type o...,3-bedroom-house-to-rent-in-tyu-tyu-north-11529...,115295115,House,No,No,...,1.0,1.0,1.0,1.0,No,,,,,
1,R 7 000Per Month,"Bhisho, Bhisho Central",3 Bedroom House to Rent in Bhisho Central,How CONVENIENT is this flat directly opposite ...,Property OverviewListing Number115332086Type o...,3-bedroom-house-to-rent-in-bhisho-central-1153...,115332086,House,No,No,...,1.0,1.0,,1.0,,,,1.0,,
2,R 5 300Per Month,"Bhisho, Tyu Tyu North",2 Bedroom House to Rent in Tyu Tyu North,Comfortable Home for a family which is availab...,Property OverviewListing Number115210602Type o...,2-bedroom-house-to-rent-in-tyu-tyu-north-11521...,115210602,House,Yes,No,...,1.0,1.0,,1.0,,,,1.0,,
3,R 7 500Per Month,"Bhisho, Bisho Park",3 Bedroom House to Rent in Bisho Park,BetterProp is thrilled to present this well-ma...,Property OverviewListing Number115555490Type o...,3-bedroom-house-to-rent-in-bisho-park-115555490,115555490,House,Yes,No,...,1.0,,,3.0,,,,,,
4,R 5 500Per Month,"Bhisho, Bisho Park",House to Rent in Bisho Park,This neat family home boasts:3 bedroomsOpen pl...,Property OverviewListing Number115525482Type o...,house-to-rent-in-bisho-park-115525482,115525482,House,,,...,,,,,,,,,,


In [None]:
# Print the number of null values in the `Bedrooms` column
results_df["Bedrooms"].isnull().sum()

np.int64(11068)

The `property_title` column often contains structured phrases that include the number of bedrooms, for residental properties such as apartments, houses and townhouses. This consistency allows for the extraction of bedroom information directly from the title using regular expressions, which can help fill in missing values in the `Bedrooms` columns. 

In [13]:
def extract_bedrooms_from_property_title(title):
    if pd.isna(title):
        return None
    match = re.search(r'(\d+)[-\s]?bedroom', title.lower())
    return int(match.group(1)) if match else None

results_df['Bedrooms'] = results_df['Bedrooms'].fillna(
    results_df.apply(lambda row: extract_bedrooms_from_property_title(row['property_title']) if pd.isna(row['Bedrooms']) else row['Bedrooms'], axis=1))

# Optional: convert to integer if needed
results_df['Bedrooms'] = pd.to_numeric(results_df['Bedrooms'], errors='coerce').astype('Int64')

In [None]:
# Print the number of null values in the `Bedrooms` column after the function has been applied
results_df["Bedrooms"].isnull().sum()

np.int64(8521)

####  Suburb, city and province extraction

The `property_title` column follows a structure that includes the suburb where the property is located. We can use this to extract the suburb for each listing. 

In [15]:
# Extract the suburb from the property_title column
results_df['suburb'] = results_df['property_title'].str.extract(
    r'to Rent in (.+)', expand=False)

# Remove the trailing numbers from the suburb column
results_df['suburb'] = (
    results_df['suburb'].str.replace(r'-\d+$', '', regex=True).str.strip())

# Drop the property_title column as we have exytacted all the information we need
results_df.drop(columns=['property_title'], inplace=True)

# Print the first five rows of the table to see the results
results_df.head()

Unnamed: 0,price,location,property_description,property_features,listing_slug,Listing Number,Type of Property,Pets Allowed,Furnished,Bedrooms,...,Lounge,Dining Room,Parking,Backup Water,Erf Size (m²),Floor Area (m²),Garden,Pool,Garage,suburb
0,R 7 000Per Month,"Bhisho, Tyu Tyu North",RE/MAX Border is proud to present this 3-bedro...,Property OverviewListing Number115295115Type o...,3-bedroom-house-to-rent-in-tyu-tyu-north-11529...,115295115,House,No,No,3.0,...,1.0,1.0,1.0,No,,,,,,Tyu Tyu North
1,R 7 000Per Month,"Bhisho, Bhisho Central",How CONVENIENT is this flat directly opposite ...,Property OverviewListing Number115332086Type o...,3-bedroom-house-to-rent-in-bhisho-central-1153...,115332086,House,No,No,3.0,...,1.0,,1.0,,,,1.0,,,Bhisho Central
2,R 5 300Per Month,"Bhisho, Tyu Tyu North",Comfortable Home for a family which is availab...,Property OverviewListing Number115210602Type o...,2-bedroom-house-to-rent-in-tyu-tyu-north-11521...,115210602,House,Yes,No,2.0,...,1.0,,1.0,,,,1.0,,,Tyu Tyu North
3,R 7 500Per Month,"Bhisho, Bisho Park",BetterProp is thrilled to present this well-ma...,Property OverviewListing Number115555490Type o...,3-bedroom-house-to-rent-in-bisho-park-115555490,115555490,House,Yes,No,3.0,...,,,3.0,,,,,,,Bisho Park
4,R 5 500Per Month,"Bhisho, Bisho Park",This neat family home boasts:3 bedroomsOpen pl...,Property OverviewListing Number115525482Type o...,house-to-rent-in-bisho-park-115525482,115525482,House,,,,...,,,,,,,,,,Bisho Park


We will then load a seperate **city_province** csv file that contains a mapping of cities to their matching provinces to validate and fill in missing province information in the main dataset. 

In [None]:
# Load the csv that contains the capital_cities with their respective provinces and create a mapping
city_province = pd.read_csv("../data/raw/south_africa_provinces_capitals.csv")

# Crete a mapping from city to province
city_list = city_province['city'].dropna().str.strip().unique()

The `location` column contains inconsistent formats, where it lists both the suburb and city, and other times providing a full address. To address this, we use the *city_list* mapping to extract the city by matching it against known entries. If no match is found, the value is replaced with **None**.

In [17]:
def extract_city_from_location(location, city_list):
    if pd.isna(location):
        return None
    
    
    location_lower = location.lower()
    for city in city_list:
        if city.lower() in location_lower:
            return city
    return None

In [18]:
# Apply the extract_city_from_location function to extract the city from the location column
results_df['city'] = results_df['location'].apply(
    lambda loc: extract_city_from_location(loc, city_list))

# Drop the location column as it is redundant
results_df.drop(columns=["location"], inplace=True)

The `city` column contains null values because some entries in the `location` column are empty or incomplete. To address this, we create a mapping of known suburbs to their matching cities and use it to fill in the missing `city` column. 

In [19]:
# Create a mapping of suburb to known cities
suburb_to_city_map = results_df[results_df['city'].notna()].drop_duplicates(subset=['suburb', 'city']).set_index('suburb')['city'].to_dict()

# Fill missing city values based on the suburb
results_df['city'] = results_df.apply(lambda row: suburb_to_city_map.get(row['suburb'], row['city']) if pd.isna(row['city']) else row['city'], axis= 1)   

In [None]:
# Print out the number of null values within the `city` column
results_df['city'].isnull().sum()

np.int64(91)

There are 91 null rows in the city column which account for less than 0.5% of the total dataset. We remove these rows to maintain data quality given their small proportion. 

In [None]:
# Drop the city rows with null values
results_df = results_df.dropna(subset=['city'])
results_df['city'].isnull().sum()

We will then merge the dataset with the **city_province.csv** file to assign the matching province to each city. 

In [23]:
# Merge the city with its matching province
results_df = results_df.merge(city_province, on='city', how='left')

The `listing_slug` column contains a URL-friendly string summarising key details about this listing, such the number of bedrooms, property type, suburb, and a unique listing ID. This information is already avaliable in other columns, we drop `listing_slug` to avoid redundancy. 

In [24]:
# Drop the listing slug column
results_df.drop(columns=['listing_slug'], inplace=True)

#### Price column

The `price` column currently combines the price amount with its associated pricing type (e.g., R 7000Per Month), which complicates aggregation and analysis. To resolve this, we seperate the pricing type into a new `price_type` column and extract the numeric value into a new `price (in rands)` column, removing the currency symbol in the process. The cleaned data is then stored in a new DataFrame called **results_df**. 

In [25]:
# Extract the pricing option after the price amount
results_df['price_type'] = results_df['price'].str.extract(
    r'(Per\s*Month|Per\s*Day|m2|m²|Perm²)', flags=re.IGNORECASE, expand=False
)


# Normalize price_type text (remove spaces, lower case) then map to standard form
results_df['price_type'] = results_df['price_type'].str.replace(
    r'\s+', '', regex=True).str.lower()
results_df['price_type'] = results_df['price_type'].replace({
    'permonth': 'Per Month',
    'perday': 'Per Day',
    'm2': 'Per m²',
    'm²': 'Per m²',
    'perm²': 'Per m²'
})

# Remove spaces inside the price string first
results_df['price_clean'] = results_df['price'].str.replace(
    r'\s+', '', regex=True)

# Now extract the numeric part (digits only, continuous)
results_df['price (in rands)'] = results_df['price_clean'].str.extract(
    r'R?(\d+)', expand=False)

# Convert the price(in rands) column to numeric
results_df['price (in rands)'] = pd.to_numeric(
    results_df['price (in rands)'], errors='coerce')

# Drop the helper clean column if you want
results_df.drop(columns=['price_clean'], inplace=True)

# Then clean blank or whitespace-only price_type entries as before
results_df['price_type'] = results_df['price_type'].replace(
    r'^\s*$', pd.NA, regex=True)

# Preview results
print(results_df[['price', 'price_type', 'price (in rands)']].head())

              price price_type  price (in rands)
0  R 7 000Per Month  Per Month            7000.0
1  R 7 000Per Month  Per Month            7000.0
2  R 5 300Per Month  Per Month            5300.0
3  R 7 500Per Month  Per Month            7500.0
4  R 5 500Per Month  Per Month            5500.0


In [26]:
# Get the unique values in the price_type column
results_df['price_type'].value_counts()

price_type
Per Month    12826
Per m²         948
Per Day        143
Name: count, dtype: int64

In [27]:
# Get the number of null values in the price_type column
results_df['price_type'].isnull().sum()

np.int64(12)

In [28]:
# Visualize the rows where price_type is null
results_df[results_df['price_type'].isna()]

Unnamed: 0,price,property_description,property_features,Listing Number,Type of Property,Pets Allowed,Furnished,Bedrooms,Bathrooms,Kitchen,...,Erf Size (m²),Floor Area (m²),Garden,Pool,Garage,suburb,city,province,price_type,price (in rands)
357,POA,"Unit 1 within 29 Grey Street, Hilton - FS, Blo...",Property OverviewListing Number114774738Type o...,114774738,Industrial,,No,,,,...,,,,,,Hilton,Bloemfontein,Free State,,
3728,POA,"Salty ocean breezes, golden sunsets, breathtak...",Property OverviewListing Number115605901Type o...,115605901,House,No,Yes,4.0,6.0,2.0,...,,,1.0,1.0,1.0,Bantry Bay,Cape Town,Western Cape,,
3760,POA,Peering over the edge of Africa from its cliff...,Property OverviewListing Number115594664Type o...,115594664,House,No,Yes,5.0,7.0,2.0,...,,,,1.0,1.0,Clifton,Cape Town,Western Cape,,
3771,POA,Drawing inspiration from the whitewashed villa...,Property OverviewListing Number115613951Type o...,115613951,House,,Yes,5.0,5.0,2.0,...,,,,1.0,1.0,Fresnaye,Cape Town,Western Cape,,
3837,POA,"With its gaze fixed to the horizon, this beach...",Property OverviewListing Number115606187Type o...,115606187,House,No,Yes,4.0,5.0,1.0,...,,,,1.0,2.0,Clifton,Cape Town,Western Cape,,
4180,POA,An exceptional and very glamorous Llandudno vi...,Property OverviewListing Number108778518Type o...,108778518,House,,Yes,5.0,5.0,1.0,...,991.0,862.0,,,,Llandudno,Cape Town,Western Cape,,
4587,POA,*Please note that the daily rate ranges from R...,Property OverviewListing Number115118688Type o...,115118688,House,No,Yes,5.0,4.0,1.0,...,530.0,,,,,Camps Bay,Cape Town,Western Cape,,
4610,POA,Clifton beach bungalow with entertainment area...,Property OverviewListing Number115153896Type o...,115153896,House,No,Yes,5.0,5.0,,...,,463.0,,,,Clifton,Cape Town,Western Cape,,
4896,POA,A beautifully crafted furnished home with the ...,Property OverviewListing Number110813037Type o...,110813037,House,,Yes,8.0,8.0,1.0,...,,605.0,,,,Llandudno,Cape Town,Western Cape,,
4899,POA,"Welcome to this Constantia homestead, steeped ...",Property OverviewListing Number113953599Type o...,113953599,House,,Yes,7.0,7.0,,...,,,,,,Constantia,Cape Town,Western Cape,,


Identify price columns that are equal to POA and contain the price type and those without. 

In [29]:
# Get the number of rows in the dataset in the 'price' column that are equal to 'POA'
poa_mask = results_df['price'].astype(
    str).str.contains("POA", case=False, na=False)
num_poa = poa_mask.sum()

# Print the total number of rows
print(f"There are {num_poa} listings with 'POA' in the price column.")

There are 217 listings with 'POA' in the price column.


In [30]:
# Identify POA rows with valid price_type
poa_mask = results_df['price'].str.lower().str.startswith('poa')
poa_with_type_mask = poa_mask & results_df['price_type'].notna() & (results_df['price_type'] != '')

# Calculate averages for the suburb, Type of Property and price_type
avg_prices = results_df[~poa_mask & results_df['price (in rands)'].notna()] \
    .groupby(['suburb', 'Type of Property', 'price_type'])['price (in rands)'] \
    .apply(lambda x: pd.to_numeric(x, errors='coerce').mean()) \
    .reset_index().rename(columns={'price (in rands)': 'price_avg'})
    
# Merge the averages into the original DataFrame
results_df = results_df.merge(avg_prices, on=['suburb', 'Type of Property', 'price_type'], how='left')

# Replace the POA with non-empty price_types with the averages
results_df.loc[poa_with_type_mask,
'price (in rands)'] = results_df.loc[poa_with_type_mask, 'price_avg']

# Drop the POA rows without a price type
results_df = results_df[~(poa_mask & ~poa_with_type_mask)]

# Drop the helper column
results_df.drop(columns=['price_avg'], inplace=True)

In [31]:
print("Remaining 'POA':", results_df['price'].str.lower().eq('poa').sum())
print("Remaining nulls in 'price (in rands)':",
      results_df['price (in rands)'].isna().sum())

Remaining 'POA': 0
Remaining nulls in 'price (in rands)': 12


In [32]:
# Get the rows where price (in rands) is null
results_df[results_df['price (in rands)'].isna()]

Unnamed: 0,price,property_description,property_features,Listing Number,Type of Property,Pets Allowed,Furnished,Bedrooms,Bathrooms,Kitchen,...,Erf Size (m²),Floor Area (m²),Garden,Pool,Garage,suburb,city,province,price_type,price (in rands)
4246,POAPer Day,Positioned in the prestigious Fresnaye neighbo...,Property OverviewListing Number115374518Type o...,115374518,Apartment / Flat,,,4.0,4.0,1.0,...,,,,1.0,,Fresnaye,Cape Town,Western Cape,Per Day,
4586,POAPer Day,Escape to Paradise...Welcome to a world of enc...,Property OverviewListing Number112682529Type o...,112682529,House,,Yes,7.0,7.0,,...,,,,,,Zwaanswyk,Cape Town,Western Cape,Per Day,
4836,POAPer Day,This magnificent penthouse located in Barley B...,Property OverviewListing Number111244618Type o...,111244618,Apartment / Flat,,,3.0,3.0,,...,,286.0,,1.0,,Bakoven,Cape Town,Western Cape,Per Day,
4863,POAPer Day,Discover Luxury Living in the Heart of Cape To...,Property OverviewListing Number114968205Type o...,114968205,Apartment / Flat,,,4.0,4.0,1.0,...,,,,1.0,,Fresnaye,Cape Town,Western Cape,Per Day,
4991,POAPer Day,*Available 01 May 2025 - 31 August 2025*If it ...,Property OverviewListing Number111859212Type o...,111859212,Apartment / Flat,,Yes,3.0,4.0,1.0,...,,,,1.0,,Bantry Bay,Cape Town,Western Cape,Per Day,
7450,POAPer m²,1.\tOffices with large windows2.\tKitchenette3...,Property OverviewListing Number115523149Type o...,115523149,Commercial,,,,,,...,,235.0,,,,Heriotdale,Johannesburg,Gauteng,Per m²,
8317,POAPer m²,1.\tAmple Parking2.\tOffices with large window...,Property OverviewListing Number115418255Type o...,115418255,Commercial,,,,,,...,,260.0,,,,Heriotdale,Johannesburg,Gauteng,Per m²,
8706,POAPer Month,Discover this spacious 500 square meter proper...,Property OverviewListing Number114459245Type o...,114459245,Industrial,,No,,,,...,500.0,500.0,,,,Westgate,Johannesburg,Gauteng,Per Month,
8758,POAPer m²,1.\tOffices2.\tAmple Kitchenette3.\tAmple Ablu...,Property OverviewListing Number115517306Type o...,115517306,Commercial,,,,,,...,,225.0,,,,Heriotdale,Johannesburg,Gauteng,Per m²,
12308,POAPer Month,"This rarely available 3,788sqm industrial ware...",Property OverviewListing Number113997397Type o...,113997397,Industrial,,No,,,,...,,,,,,Nelsville,Nelspruit,Mpumalanga,Per Month,


In [33]:
# Drop rows in the price (in rands) column that are null
results_df = results_df.dropna(subset=['price (in rands)'])

In [34]:
# Drop the price column as we do not need it anymore
results_df.drop(columns=['price'], inplace=True)

### Pets Allowed

In [35]:
results_df['Pets Allowed'].isnull().sum()

np.int64(10304)

In [36]:
results_df['Pets Allowed'].value_counts()

Pets Allowed
No     2598
Yes    1003
Name: count, dtype: int64

In [37]:
# Calculate mode for each property type
pets_mode = results_df.groupby('Type of Property')['Pets Allowed'].agg(
    lambda x: x.mode().iloc[0] if not x.mode().empty else pd.NA
)
print(pets_mode)

Type of Property
Apartment / Flat     No
Commercial           No
Farm                Yes
House               Yes
Industrial           No
Townhouse            No
Vacant Land          No
Name: Pets Allowed, dtype: object


In [38]:
default_pets_value = {
    'Apartment / Flat': 'No', 
    'Commercial': 'No',
    'Farm': 'Yes',
    'House': 'Yes', 
    'Industrial': 'No', 
    'Townhouse': 'No', 
    'Vacant Land': 'No'
}

results_df['Pets Allowed'] = results_df.apply(
    lambda row: default_pets_value[row['Type of Property']]
    if pd.isna(row['Pets Allowed']) and row['Type of Property'] in default_pets_value
    else row['Pets Allowed'],
    axis=1
)

In [39]:
print(results_df['Pets Allowed'].isna().sum())

0


### Furnished

In [40]:
results_df['Furnished'].isnull().sum()

np.int64(4635)

In [41]:
results_df['Furnished'].value_counts()

Furnished
No     7962
Yes    1308
Name: count, dtype: int64

In [42]:
# Calculate mode for each property type for the furnished column
furnished_mode = results_df.groupby('Type of Property')['Furnished'].agg(
    lambda x: x.mode().iloc[0] if not x.mode().empty else pd.NA
)
print(furnished_mode)

Type of Property
Apartment / Flat    No
Commercial          No
Farm                No
House               No
Industrial          No
Townhouse           No
Vacant Land         No
Name: Furnished, dtype: object


In [43]:
# Fill the null vlaues in the furnished column with unknown
results_df['Furnished'] = results_df['Furnished'].fillna("Unknown")

In [44]:
print(results_df['Furnished'].isna().sum())

0


### Bedrooms

In [45]:
results_df['Bedrooms'].isnull().sum()

np.int64(8449)

In [46]:
results_df['Bedrooms'].value_counts()

Bedrooms
2     1825
1     1446
3     1178
5      403
4      347
0      177
6       45
7       17
8        8
9        4
11       2
13       2
12       1
10       1
Name: count, dtype: Int64

In [47]:
# Calculate mode for each property type for the furnished column
furnished_mode = results_df.groupby('Type of Property')['Bedrooms'].agg(
    lambda x: x.mode().iloc[0] if not x.mode().empty else pd.NA
)
print(furnished_mode)

Type of Property
Apartment / Flat       2
Commercial             0
Farm                   2
House                  3
Industrial             2
Townhouse              2
Vacant Land         <NA>
Name: Bedrooms, dtype: Int64


In [48]:
bedroom_medians = results_df.groupby('Type of Property')['Bedrooms'].median()

def fill_bedrooms(row):
    if pd.isna(row['Bedrooms']):
        median = bedroom_medians.get(row['Type of Property'], np.nan)
        return median if pd.notna(median) else 0
    return row['Bedrooms']

results_df['Bedrooms'] = results_df.apply(fill_bedrooms, axis=1)

In [49]:
results_df['Bedrooms'].isnull().sum()

np.int64(0)

### Bathrooms

In [50]:
results_df['Bathrooms'].isnull().sum()

np.int64(6113)

In [51]:
results_df['Bathrooms'].value_counts()

Bathrooms
1      3937
2      2662
3       529
4       347
6        99
5        90
8        41
7        26
10       21
0        13
12        6
15        5
9         4
20        3
14        2
16        2
11        1
13        1
30        1
200       1
160       1
Name: count, dtype: int64

In [52]:
# Ensure 'Bathrooms' is numeric
results_df['Bathrooms'] = pd.to_numeric(
    results_df['Bathrooms'], errors='coerce')

# Filter listings with Bathrooms > 10
bathroom_outliers = results_df[results_df['Bathrooms'] > 9]

# Group by Type of Property and count
bathroom_summary = bathroom_outliers.groupby(
    'Type of Property')['Bathrooms'].count().reset_index()

# Rename column for clarity
bathroom_summary.columns = ['Type of Property', 'Count with >10 Bathrooms']

# Sort by count descending
bathroom_summary = bathroom_summary.sort_values(
    by='Count with >10 Bathrooms', ascending=False)

print(bathroom_summary)

  Type of Property  Count with >10 Bathrooms
0       Commercial                        21
2       Industrial                        19
1            House                         4


In [53]:
threshold = 10
extreme_bathrooms = results_df[results_df['Bathrooms'] > threshold]

In [54]:
# Set the threshold for outlier bathroom numbers
threshold = 9 

outliner_bathrooms = results_df[results_df['Bathrooms'] > threshold]

bathroom_median = results_df[results_df['Bathrooms'] <= threshold].groupby('Type of Property')['Bathrooms'].median()

bathroom_median = bathroom_median.fillna(0)

for idx, row in outliner_bathrooms.iterrows():
    prop_type = row['Type of Property']
    median_val = bathroom_median.get(prop_type, 0)
    results_df.at[idx, 'Bathrooms'] = median_val

### Kitchen

In [55]:
results_df['Kitchen'].isnull().sum()

np.int64(7387)

In [56]:
# Ensure that the Kitchen column is numeric
results_df['Kitchen'] = pd.to_numeric(
    results_df['Kitchen'], errors='coerce')


# Group by property type and calculate median
median_bathrooms = results_df.groupby('Type of Property')[
    'Kitchen'].median().reset_index()

# Rename columns for clarity
median_bathrooms.columns = ['Type of Property', 'Median Kitchen']

# Sort by median descending (optional)
median_bathrooms = median_bathrooms.sort_values(
    by='Median Kitchen', ascending=False)

print(median_bathrooms)

   Type of Property  Median Kitchen
0  Apartment / Flat             1.0
1        Commercial             1.0
2              Farm             1.0
3             House             1.0
4        Industrial             1.0
5         Townhouse             1.0
6       Vacant Land             NaN


In [57]:
results_df['Kitchen'].value_counts()

Kitchen
1.0     6116
2.0      304
4.0       44
3.0       40
5.0       10
6.0        2
21.0       1
10.0       1
Name: count, dtype: int64

In [58]:
kitchen_threshold = 6

# Identify outliers
kitchen_outliers = results_df[results_df['Kitchen'] > kitchen_threshold]

# Compute medians and replace NaNs with 0
kitchen_median = results_df[results_df['Kitchen'] <= kitchen_threshold] \
    .groupby('Type of Property')['Kitchen'].median().fillna(0)

# Replace outlier values
for idx, row in kitchen_outliers.iterrows():
    prop_type = row['Type of Property']
    median_val = kitchen_median.get(prop_type, 1)
    results_df.at[idx, 'Kitchen'] = median_val

In [59]:
# Fill NaN kitchen values with property-type median, or 0 if still missing
def fill_missing_kitchen(row):
    if pd.isna(row['Kitchen']):
        return kitchen_median.get(row['Type of Property'], 0)
    return row['Kitchen']


results_df['Kitchen'] = results_df.apply(fill_missing_kitchen, axis=1)

In [60]:
results_df['Kitchen'].isnull().sum()

np.int64(0)

### Lounge

In [61]:
results_df['Lounge'].isnull().sum()

np.int64(12299)

In [62]:
results_df['Lounge'].value_counts()

Lounge
1    1575
2      29
3       2
Name: count, dtype: int64

In [63]:
# Ensure that the Lounge column is numeric
results_df['Lounge'] = pd.to_numeric(
    results_df['Lounge'], errors='coerce')


# Group by property type and calculate median
median_lounge = results_df.groupby('Type of Property')[
    'Lounge'].median().reset_index()

# Rename columns for clarity
median_lounge.columns = ['Type of Property', 'Median Lounge']

# Sort by median descending (optional)
median_lounge = median_lounge.sort_values(
    by='Median Lounge', ascending=False)

print(median_bathrooms)

   Type of Property  Median Kitchen
0  Apartment / Flat             1.0
1        Commercial             1.0
2              Farm             1.0
3             House             1.0
4        Industrial             1.0
5         Townhouse             1.0
6       Vacant Land             NaN


In [64]:
# Compute median lounges per property type
lounge_median = results_df.groupby('Type of Property')['Lounge'].median()

# Fill NaNs in median with 0
median_lounge = lounge_median.fillna(0)

# Function using the correct filled median


def fill_lounge(row):
    if pd.isna(row['Lounge']):
        return median_lounge.get(row['Type of Property'], 0)
    return row['Lounge']


# Apply the fix
results_df['Lounge'] = results_df.apply(fill_lounge, axis=1)

In [65]:
results_df['Lounge'].isnull().sum()

np.int64(0)

### Dining Room

In [66]:
results_df['Dining Room'].isnull().sum()

np.int64(13103)

In [67]:
results_df['Dining Room'].value_counts()

Dining Room
1    795
2      6
3      1
Name: count, dtype: int64

In [68]:
# Ensure that the Dining Room column is numeric
results_df['Dining Room'] = pd.to_numeric(
    results_df['Dining Room'], errors='coerce')


# Group by property type and calculate median
median_dining_room = results_df.groupby('Type of Property')[
    'Dining Room'].median().reset_index()

# Rename columns for clarity
median_dining_room.columns = ['Type of Property', 'Median Dining Room']

# Sort by median descending (optional)
median_dining_room = median_dining_room.sort_values(
    by='Median Dining Room', ascending=False)

print(median_bathrooms)

   Type of Property  Median Kitchen
0  Apartment / Flat             1.0
1        Commercial             1.0
2              Farm             1.0
3             House             1.0
4        Industrial             1.0
5         Townhouse             1.0
6       Vacant Land             NaN


In [69]:
# Compute median lounges per property type
dining_room_median = results_df.groupby('Type of Property')['Dining Room'].median()

# Fill NaNs in median with 0
median_dining_room = dining_room_median.fillna(0)

# Function using the correct filled median


def fill_lounge(row):
    if pd.isna(row['Dining Room']):
        return median_dining_room.get(row['Type of Property'], 0)
    return row['Dining Room']


# Apply the fix
results_df['Dining Room'] = results_df.apply(fill_lounge, axis=1)

In [70]:
results_df['Dining Room'].isnull().sum()

np.int64(0)

### Garage

In [71]:
results_df['Garage'].isnull().sum()

np.int64(13242)

In [72]:
results_df['Garage'].value_counts()

Garage
1     388
2     233
3      18
4      12
5       5
40      2
14      1
20      1
10      1
6       1
9       1
Name: count, dtype: int64

In [73]:
# Ensure that the Garage column is numeric
results_df['Garage'] = pd.to_numeric(
    results_df['Garage'], errors='coerce')


# Group by property type and calculate median
median_garage = results_df.groupby('Type of Property')[
    'Garage'].median().reset_index()

# Rename columns for clarity
median_garage.columns = ['Type of Property', 'Median Garage']

# Sort by median descending (optional)
median_garage = median_garage.sort_values(
    by='Median Garage', ascending=False)

print(median_garage)

   Type of Property  Median Garage
4        Industrial           30.0
1        Commercial           12.0
3             House            2.0
0  Apartment / Flat            1.0
5         Townhouse            1.0
2              Farm            NaN
6       Vacant Land            NaN


In [74]:
garage_threshold = 5

# Identify outliers
garage_outliers = results_df[results_df['Garage'] > garage_threshold]

# Compute medians and replace NaNs with 0
garage_median = results_df[results_df['Garage'] <= garage_threshold].groupby('Type of Property')['Garage'].median().fillna(0)

# Replace outlier values
for idx, row in garage_outliers.iterrows():
    prop_type = row['Type of Property']
    median_val = garage_median.get(prop_type, 1)
    results_df.at[idx, 'Garage'] = median_val

In [75]:
# Fill NaN kitchen values with property-type median, or 0 if still missing
def fill_missing_garage(row):
    if pd.isna(row['Garage']):
        return garage_median.get(row['Type of Property'], 0)
    return row['Garage']


results_df['Garage'] = results_df.apply(fill_missing_garage, axis=1)

In [76]:
results_df['Garage'].isnull().sum()

np.int64(0)

### Garden

In [77]:
results_df['Garden'].isnull().sum()

np.int64(12622)

In [78]:
results_df['Garden'].value_counts()

Garden
1     1274
2        7
23       1
3        1
Name: count, dtype: int64

In [79]:
# Ensure that the Garden column is numeric
results_df['Garden'] = pd.to_numeric(
    results_df['Garden'], errors='coerce')


# Group by property type and calculate median
median_garden = results_df.groupby('Type of Property')[
    'Garden'].median().reset_index()

# Rename columns for clarity
median_garden.columns = ['Type of Property', 'Median Garden']

# Sort by median descending (optional)
median_garden = median_garden.sort_values(
    by='Median Garden', ascending=False)

print(median_garden)

   Type of Property  Median Garden
0  Apartment / Flat            1.0
1        Commercial            1.0
2              Farm            1.0
3             House            1.0
4        Industrial            1.0
5         Townhouse            1.0
6       Vacant Land            1.0


In [80]:
garden_threshold = 3

# Identify outliers
garden_outliers = results_df[results_df['Garden'] > garden_threshold]

# Compute medians and replace NaNs with 0
garden_median = results_df[results_df['Garden'] <= garden_threshold].groupby(
    'Type of Property')['Garden'].median().fillna(0)

# Replace outlier values
for idx, row in garden_outliers.iterrows():
    prop_type = row['Type of Property']
    median_val = garden_median.get(prop_type, 1)
    results_df.at[idx, 'Garden'] = median_val

In [81]:
# Fill NaN kitchen values with property-type median, or 0 if still missing
def fill_missing_garden(row):
    if pd.isna(row['Garden']):
        return garage_median.get(row['Type of Property'], 0)
    return row['Garden']


results_df['Garden'] = results_df.apply(fill_missing_garden, axis=1)

In [82]:
results_df['Garden'].isnull().sum()

np.int64(0)

In [83]:
results_df['Garden'].value_counts()

Garden
5.0    5979
1.0    4698
0.0    2075
2.0    1152
3.0       1
Name: count, dtype: int64

### Pool

In [84]:
results_df['Pool'].isnull().sum()

np.int64(13175)

In [85]:
results_df['Pool'].value_counts()

Pool
1    719
2     10
3      1
Name: count, dtype: int64

In [86]:
# Ensure that the Pool column is numeric
results_df['Pool'] = pd.to_numeric(
    results_df['Pool'], errors='coerce')


# Group by property type and calculate median
median_pool = results_df.groupby('Type of Property')[
    'Pool'].median().reset_index()

# Rename columns for clarity
median_pool.columns = ['Type of Property', 'Median Pool']

# Sort by median descending (optional)
median_pool = median_pool.sort_values(
    by='Median Pool', ascending=False)

print(median_pool)

   Type of Property  Median Pool
0  Apartment / Flat          1.0
1        Commercial          1.0
3             House          1.0
5         Townhouse          1.0
2              Farm          NaN
4        Industrial          NaN
6       Vacant Land          NaN


In [87]:
results_df['Pool'] = results_df['Pool'].fillna(0)

In [88]:
results_df['Pool'].isnull().sum()

np.int64(0)

### Parking

In [89]:
results_df['Parking'].isnull().sum()

np.int64(8477)

In [90]:
results_df['Parking'].value_counts()

Parking
1      2361
2      1122
4       385
3       242
5       169
       ... 
65        1
137       1
175       1
160       1
66        1
Name: count, Length: 94, dtype: int64

In [91]:
# Ensure that the Parking column is numeric
results_df['Parking'] = pd.to_numeric(
    results_df['Parking'], errors='coerce')


# Group by property type and calculate median
median_parking = results_df.groupby('Type of Property')[
    'Parking'].median().reset_index()

# Rename columns for clarity
median_parking.columns = ['Type of Property', 'Median Parking']

# Sort by median descending (optional)
median_parking = median_parking.sort_values(
    by='Median Parking', ascending=False)

print(median_parking)

   Type of Property  Median Parking
1        Commercial             6.0
4        Industrial             5.0
2              Farm             2.0
3             House             2.0
0  Apartment / Flat             1.0
5         Townhouse             1.0
6       Vacant Land             1.0


In [92]:
parking_threshold = 5

# Identify outliers
parking_outliers = results_df[results_df['Parking'] > parking_threshold]

# Compute medians and replace NaNs with 0
parking_median = results_df[results_df['Parking'] <= parking_threshold].groupby(
    'Type of Property')['Parking'].median().fillna(0)

# Replace outlier values
for idx, row in parking_outliers.iterrows():
    prop_type = row['Type of Property']
    median_val = parking_median.get(prop_type, 1)
    results_df.at[idx, 'Parking'] = median_val

In [93]:
# Fill NaN kitchen values with property-type median, or 0 if still missing
def fill_missing_parking(row):
    if pd.isna(row['Parking']):
        return garage_median.get(row['Type of Property'], 0)
    return row['Parking']


results_df['Parking'] = results_df.apply(fill_missing_parking, axis=1)

In [94]:
results_df['Parking'].isnull().sum()

np.int64(0)

### Backup Water

In [95]:
results_df['Backup Water'].isnull().sum()

np.int64(12410)

In [96]:
results_df['Backup Water'].value_counts()

Backup Water
Yes    757
No     738
Name: count, dtype: int64

In [97]:
# Ensure that the Backup Water column is numeric
results_df['Backup Water'] = pd.to_numeric(
    results_df['Backup Water'], errors='coerce')


# Group by property type and calculate median
median_backup_water = results_df.groupby('Type of Property')[
    'Backup Water'].median().reset_index()

# Rename columns for clarity
median_backup_water.columns = ['Type of Property', 'Median Backup Water']

# Sort by median descending (optional)
median_backup_water = median_backup_water.sort_values(
    by='Median Backup Water', ascending=False)

print(median_backup_water)

   Type of Property  Median Backup Water
0  Apartment / Flat                  NaN
1        Commercial                  NaN
2              Farm                  NaN
3             House                  NaN
4        Industrial                  NaN
5         Townhouse                  NaN
6       Vacant Land                  NaN


In [99]:
results_df['Backup Water'] = results_df['Backup Water'].fillna('Unknown')

In [101]:
results_df['Backup Water'].isnull().sum()

np.int64(0)

### Floor Size

In [103]:
results_df['Floor Area (m²)'].isnull().sum()

np.int64(6234)

In [104]:
results_df['Floor Area (m²)'].value_counts()

Floor Area (m²)
300    105
30      93
50      90
100     89
60      78
      ... 
783      1
577      1
635      1
990      1
974      1
Name: count, Length: 764, dtype: int64

In [106]:
# Convert Floor Area to numeric (coerce errors to NaN)
results_df['Floor Area (m²)'] = pd.to_numeric(
    results_df['Floor Area (m²)'], errors='coerce')

floor_area_median = results_df.groupby('Type of Property')['Floor Area (m²)'].median().fillna(0)

def fill_floor_size(row):
    if pd.isna(row['Floor Area (m²)']):
        return floor_area_median.get(row['Type of Property'], 0)
    return row['Floor Area (m²)']


results_df['Floor Area (m²)'] = results_df.apply(fill_floor_size, axis=1)

In [107]:
results_df['Floor Area (m²)'].isnull().sum()

np.int64(0)

### Erf Size

In [109]:
results_df['Erf Size (m²)'].isnull().sum()

np.int64(10801)

In [110]:
results_df['Erf Size (m²)'].value_counts()

Erf Size (m²)
495    68
500    40
250    39
100    34
800    34
       ..
922     1
689     1
877     1
474     1
416     1
Name: count, Length: 716, dtype: int64

In [112]:
results_df['Erf Size (m²)'] = pd.to_numeric(
    results_df['Erf Size (m²)'], errors='coerce')

floor_size_median = results_df.groupby('Type of Property')[
    'Erf Size (m²)'].median().fillna(0)


def fill_floor_size(row):
    if pd.isna(row['Erf Size (m²)']):
        return floor_size_median.get(row['Type of Property'], 0)
    return row['Erf Size (m²)']


results_df['Erf Size (m²)'] = results_df.apply(fill_floor_size, axis=1)

In [113]:
results_df['Erf Size (m²)'].isnull().sum()

np.int64(0)