# <center> 🧽 Data Cleaning </center>

## Table of Contents

- [1. Introduction](#1-introduction)
- [2. Loading-Data](#2-loading-data)

## 1. Introduction
[Back to Table of Contents](#table-of-contents)

---

| ⚡ Description: Introducing the project ⚡ |
| :--------------------------- |

---

<a id="two"></a>
## 2. Loading Data
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data in our environment ⚡ |
| :--------------------------- |

---

### Import packages

In [1]:
# Import packages
import numpy as np
import pandas as pd
import re

### Rental Properties

In [2]:
rental_listings = pd.read_csv("../data/raw/rental_listings_for_capital_cities.csv")
display("Rentals", rental_listings.head(), rental_listings.shape)

'Rentals'

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug
0,R 7 000Per Month,"Bhisho, Tyu Tyu North",3 Bedroom House to Rent in Tyu Tyu North,RE/MAX Border is proud to present this 3-bedro...,Property OverviewListing Number115295115Type o...,3-bedroom-house-to-rent-in-tyu-tyu-north-11529...
1,R 7 000Per Month,"Bhisho, Bhisho Central",3 Bedroom House to Rent in Bhisho Central,How CONVENIENT is this flat directly opposite ...,Property OverviewListing Number115332086Type o...,3-bedroom-house-to-rent-in-bhisho-central-1153...
2,R 5 300Per Month,"Bhisho, Tyu Tyu North",2 Bedroom House to Rent in Tyu Tyu North,Comfortable Home for a family which is availab...,Property OverviewListing Number115210602Type o...,2-bedroom-house-to-rent-in-tyu-tyu-north-11521...
3,,,,,,house-to-rent-in-bisho-park-101731048
4,R 7 500Per Month,"Bhisho, Bisho Park",3 Bedroom House to Rent in Bisho Park,BetterProp is thrilled to present this well-ma...,Property OverviewListing Number115555490Type o...,3-bedroom-house-to-rent-in-bisho-park-115555490


(15324, 6)

In [3]:
rental_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15324 entries, 0 to 15323
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   price                 15323 non-null  object
 1   location              8189 non-null   object
 2   property_title        15323 non-null  object
 3   property_description  15323 non-null  object
 4   property_features     15323 non-null  object
 5   listing_slug          15324 non-null  object
dtypes: object(6)
memory usage: 718.4+ KB


In [None]:
rental_listings.isnull().sum()

In [4]:
rental_null_values = rental_listings.isnull().sum().sum()
print(f"There are {rental_null_values} null values in the rental dataset")

There are 7139 null values in the rental dataset


In [5]:
rental_duplicates = rental_listings[rental_listings.duplicated()]
num_rental_duplicates = len(rental_duplicates)
print(f"There are {num_rental_duplicates} duplicates in the rental dataset ")

There are 1303 duplicates in the rental dataset 


In [6]:
# Drop duplicate rows and keep the first occurrence
rerental_listings =rental_listings.drop_duplicates().reset_index(drop=True)

In [7]:
rental_listings = rental_listings.dropna(subset=['price'])

In [8]:
# Normalize spaces
rental_listings['price'] = rental_listings['price'].str.replace(
    r'\s+', '', regex=True)

# Extract price amount
rental_listings['price (in rands)'] = rental_listings['price'].str.extract(
    r'R?(\d+)', expand=False)
rental_listings['price (in rands)'] = pd.to_numeric(rental_listings['price (in rands)'], errors='coerce')

# Now safely extract price type from the actual price string
rental_listings['price_type'] = rental_listings['price'].str.extract(
    r'(PerMonth|m2|m²|Perm²)', expand=False
)

# Standardize price type to make it more readable
rental_listings['price_type'] = rental_listings['price_type'].replace({
    'PerMonth': 'Per Month',
    'm2': 'Per m²',
    'Perm²': 'Per m²'
})

# Print out the first five rows of the table to see the results
rental_listings.head()

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug,price (in rands),price_type
0,R7000PerMonth,"Bhisho, Tyu Tyu North",3 Bedroom House to Rent in Tyu Tyu North,RE/MAX Border is proud to present this 3-bedro...,Property OverviewListing Number115295115Type o...,3-bedroom-house-to-rent-in-tyu-tyu-north-11529...,7000.0,Per Month
1,R7000PerMonth,"Bhisho, Bhisho Central",3 Bedroom House to Rent in Bhisho Central,How CONVENIENT is this flat directly opposite ...,Property OverviewListing Number115332086Type o...,3-bedroom-house-to-rent-in-bhisho-central-1153...,7000.0,Per Month
2,R5300PerMonth,"Bhisho, Tyu Tyu North",2 Bedroom House to Rent in Tyu Tyu North,Comfortable Home for a family which is availab...,Property OverviewListing Number115210602Type o...,2-bedroom-house-to-rent-in-tyu-tyu-north-11521...,5300.0,Per Month
4,R7500PerMonth,"Bhisho, Bisho Park",3 Bedroom House to Rent in Bisho Park,BetterProp is thrilled to present this well-ma...,Property OverviewListing Number115555490Type o...,3-bedroom-house-to-rent-in-bisho-park-115555490,7500.0,Per Month
5,R5500PerMonth,"Bhisho, Bisho Park",House to Rent in Bisho Park,This neat family home boasts:3 bedroomsOpen pl...,Property OverviewListing Number115525482Type o...,house-to-rent-in-bisho-park-115525482,5500.0,Per Month


In [9]:
rental_listings['listing_slug'].isnull().sum()

np.int64(0)

In [10]:
# Extract the location from the listing_slug column
rental_listings['suburb'] = rental_listings['property_title'].str.extract(
    r'to Rent in (.+)', expand=False)

# Remove the trailing numbers from the suburb column
rental_listings['suburb'] = (
    rental_listings['suburb'].str.replace(r'-\d+$', '', regex=True).str.strip())

# Print the first five rows of the table to see the results
rental_listings.head()

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug,price (in rands),price_type,suburb
0,R7000PerMonth,"Bhisho, Tyu Tyu North",3 Bedroom House to Rent in Tyu Tyu North,RE/MAX Border is proud to present this 3-bedro...,Property OverviewListing Number115295115Type o...,3-bedroom-house-to-rent-in-tyu-tyu-north-11529...,7000.0,Per Month,Tyu Tyu North
1,R7000PerMonth,"Bhisho, Bhisho Central",3 Bedroom House to Rent in Bhisho Central,How CONVENIENT is this flat directly opposite ...,Property OverviewListing Number115332086Type o...,3-bedroom-house-to-rent-in-bhisho-central-1153...,7000.0,Per Month,Bhisho Central
2,R5300PerMonth,"Bhisho, Tyu Tyu North",2 Bedroom House to Rent in Tyu Tyu North,Comfortable Home for a family which is availab...,Property OverviewListing Number115210602Type o...,2-bedroom-house-to-rent-in-tyu-tyu-north-11521...,5300.0,Per Month,Tyu Tyu North
4,R7500PerMonth,"Bhisho, Bisho Park",3 Bedroom House to Rent in Bisho Park,BetterProp is thrilled to present this well-ma...,Property OverviewListing Number115555490Type o...,3-bedroom-house-to-rent-in-bisho-park-115555490,7500.0,Per Month,Bisho Park
5,R5500PerMonth,"Bhisho, Bisho Park",House to Rent in Bisho Park,This neat family home boasts:3 bedroomsOpen pl...,Property OverviewListing Number115525482Type o...,house-to-rent-in-bisho-park-115525482,5500.0,Per Month,Bisho Park


In [11]:
def extract_city(location):
    if pd.isna(location):
        return None
    parts = [part.strip() for part in location.split(',')]
    if len(parts) == 2:
        return parts[0]
    elif len(parts) >= 3:
        return parts[-1]
    else:
        return None

# Apply the function to the rental_listings dataset
rental_listings['city'] = rental_listings['location'].apply(extract_city)

In [12]:
rental_listings.head()

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug,price (in rands),price_type,suburb,city
0,R7000PerMonth,"Bhisho, Tyu Tyu North",3 Bedroom House to Rent in Tyu Tyu North,RE/MAX Border is proud to present this 3-bedro...,Property OverviewListing Number115295115Type o...,3-bedroom-house-to-rent-in-tyu-tyu-north-11529...,7000.0,Per Month,Tyu Tyu North,Bhisho
1,R7000PerMonth,"Bhisho, Bhisho Central",3 Bedroom House to Rent in Bhisho Central,How CONVENIENT is this flat directly opposite ...,Property OverviewListing Number115332086Type o...,3-bedroom-house-to-rent-in-bhisho-central-1153...,7000.0,Per Month,Bhisho Central,Bhisho
2,R5300PerMonth,"Bhisho, Tyu Tyu North",2 Bedroom House to Rent in Tyu Tyu North,Comfortable Home for a family which is availab...,Property OverviewListing Number115210602Type o...,2-bedroom-house-to-rent-in-tyu-tyu-north-11521...,5300.0,Per Month,Tyu Tyu North,Bhisho
4,R7500PerMonth,"Bhisho, Bisho Park",3 Bedroom House to Rent in Bisho Park,BetterProp is thrilled to present this well-ma...,Property OverviewListing Number115555490Type o...,3-bedroom-house-to-rent-in-bisho-park-115555490,7500.0,Per Month,Bisho Park,Bhisho
5,R5500PerMonth,"Bhisho, Bisho Park",House to Rent in Bisho Park,This neat family home boasts:3 bedroomsOpen pl...,Property OverviewListing Number115525482Type o...,house-to-rent-in-bisho-park-115525482,5500.0,Per Month,Bisho Park,Bhisho


In [13]:
# Create a mapping of suburb to kmown cities
suburb_to_city_map = rental_listings[rental_listings['city'].notna()].drop_duplicates(subset=['suburb', 'city']).set_index('suburb')['city'].to_dict()

# Fill missing city values based on the suburb
rental_listings['city'] = rental_listings.apply(lambda row: suburb_to_city_map.get(row['suburb'], row['city']) if pd.isna(row['city']) else row['city'], axis= 1)   

In [14]:
rental_listings['city'].isnull().sum()

np.int64(85)

In [15]:
rental_listings[rental_listings['city'].isna()]

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug,price (in rands),price_type,suburb,city
4029,R297600PerMonth,,Industrial Property to Rent in Pinati,18 Induland Crescent is home to a wonderful In...,Property OverviewListing Number115017502Type o...,industrial-property-to-rent-in-pinati-115017502,297600.0,Per Month,Pinati,
4049,R90000PerMonth,,Industrial Property to Rent in Philippi East,This prime warehouse space in Philippi offers ...,Property OverviewListing Number115112950Type o...,industrial-property-to-rent-in-philippi-east-1...,90000.0,Per Month,Philippi East,
4327,R150720PerMonth,,Commercial Property to Rent in University Estate,A single-standing commercial and industrial bu...,Property OverviewListing Number115164882Type o...,commercial-property-to-rent-in-university-esta...,150720.0,Per Month,University Estate,
4451,R150720PerMonth,,Commercial Property to Rent in University Estate,A single-standing commercial and industrial bu...,Property OverviewListing Number115164882Type o...,commercial-property-to-rent-in-university-esta...,150720.0,Per Month,University Estate,
5096,R10000PerMonth,,4 Bedroom House to Rent in Penlyn Estate,This recently renovated house in Penlyn Estate...,Property OverviewListing Number114336051Type o...,4-bedroom-house-to-rent-in-penlyn-estate-11433...,10000.0,Per Month,Penlyn Estate,
...,...,...,...,...,...,...,...,...,...,...
14562,R10800PerMonth,,3 Bedroom Townhouse to Rent in Muswell Hill,Modern Duplex in secure estate. Large garden f...,Property OverviewListing Number115426368Type o...,3-bedroom-townhouse-to-rent-in-muswell-hill-11...,10800.0,Per Month,Muswell Hill,
14673,R6000PerMonth,,3 Bedroom House to Rent in Edendale N,Very quiet Area of Imbali Unit 13 in Pietermar...,Property OverviewListing Number113765087Type o...,3-bedroom-house-to-rent-in-edendale-n-113765087,6000.0,Per Month,Edendale N,
14918,R9300PerMonth,,2 Bedroom Townhouse to Rent in Eagles Crest,This charming two-bedroom townhouse in Bendor ...,Property OverviewListing Number115652637Type o...,2-bedroom-townhouse-to-rent-in-eagles-crest-11...,9300.0,Per Month,Eagles Crest,
15060,R2950PerMonth,,1 Bedroom Apartment / Flat to Rent in Leeukuil AH,Discover the serene beauty of Polokwane from t...,Property OverviewListing Number108326780Type o...,1-bedroom-apartment-flat-to-rent-in-leeukuil-a...,2950.0,Per Month,Leeukuil AH,


In [None]:
rental_listings['suburb'].isnull().sum()

### Listings for sale

In [16]:
listings_for_sale = pd.read_csv("../data/raw/listings_for_sale_in_capital_cities.csv")
display("Listings for sale", listings_for_sale.head(), listings_for_sale.shape)

'Listings for sale'

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug
0,R 690 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Bhisho Central",3 Bedroom House for Sale in Bhisho Central,RE/MAX Border is proud to present this family ...,Property OverviewListing Number115871179Type o...,3-bedroom-house-for-sale-in-bhisho-central-115...
1,R 700 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu North",3 Bedroom House for Sale in Tyu Tyu North,This spacious property offers a well-designed ...,Property OverviewListing Number115765257Type o...,3-bedroom-house-for-sale-in-tyu-tyu-north-1157...
2,R 749 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu North",3 Bedroom House for Sale in Tyu Tyu North,Sizwe Mlungwana Properties is excited to prese...,Property OverviewListing Number115864665Type o...,3-bedroom-house-for-sale-in-tyu-tyu-north-1158...
3,R 799 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Bhisho Central",3 Bedroom House for Sale in Bhisho Central,Discover this well located 3 bedroom home.Perf...,Property OverviewListing Number115811196Type o...,3-bedroom-house-for-sale-in-bhisho-central-115...
4,R 695 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu",3 Bedroom House for Sale in Tyu Tyu,Presenting this beautiful house to the market....,Property OverviewListing Number115837796Type o...,3-bedroom-house-for-sale-in-tyu-tyu-115837796


(40006, 6)

In [17]:
listings_for_sale.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40006 entries, 0 to 40005
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   price                 40002 non-null  object
 1   location              33028 non-null  object
 2   property_title        40002 non-null  object
 3   property_description  40001 non-null  object
 4   property_features     40002 non-null  object
 5   listing_slug          40006 non-null  object
dtypes: object(6)
memory usage: 1.8+ MB


In [None]:
listings_for_sale.isnull().sum()

In [18]:
# Drop listings where the property_title is null
listings_for_sale = listings_for_sale.dropna(subset=['property_title']).reset_index(drop=True)

In [19]:
listings_for_sale_duplicates = listings_for_sale[listings_for_sale.duplicated()]
num_listings_for_sale_duplicates = len(listings_for_sale_duplicates)
print(f"There are {num_listings_for_sale_duplicates} duplicates in the rental dataset ")

There are 2072 duplicates in the rental dataset 


In [20]:
# Drop duplicate rows and keep the first occurrence
listings_for_sale = listings_for_sale.drop_duplicates().reset_index(drop=True)

In [21]:
listings_for_sale['price_cleaned'] = listings_for_sale['price'].str.extract(
    r'(R\s?\d[\d\s]*)')

listings_for_sale['price_cleaned'] = listings_for_sale['price_cleaned'].str.replace(
    r'\s+', '', regex=True)

listings_for_sale['price (in rands)'] = listings_for_sale['price_cleaned'].str.extract(
    r'R?(\d+)', expand=False)

# Print out the first five rows to see the results
listings_for_sale.head()

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug,price_cleaned,price (in rands)
0,R 690 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Bhisho Central",3 Bedroom House for Sale in Bhisho Central,RE/MAX Border is proud to present this family ...,Property OverviewListing Number115871179Type o...,3-bedroom-house-for-sale-in-bhisho-central-115...,R690000,690000
1,R 700 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu North",3 Bedroom House for Sale in Tyu Tyu North,This spacious property offers a well-designed ...,Property OverviewListing Number115765257Type o...,3-bedroom-house-for-sale-in-tyu-tyu-north-1157...,R700000,700000
2,R 749 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu North",3 Bedroom House for Sale in Tyu Tyu North,Sizwe Mlungwana Properties is excited to prese...,Property OverviewListing Number115864665Type o...,3-bedroom-house-for-sale-in-tyu-tyu-north-1158...,R749000,749000
3,R 799 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Bhisho Central",3 Bedroom House for Sale in Bhisho Central,Discover this well located 3 bedroom home.Perf...,Property OverviewListing Number115811196Type o...,3-bedroom-house-for-sale-in-bhisho-central-115...,R799000,799000
4,R 695 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu",3 Bedroom House for Sale in Tyu Tyu,Presenting this beautiful house to the market....,Property OverviewListing Number115837796Type o...,3-bedroom-house-for-sale-in-tyu-tyu-115837796,R695000,695000


In [22]:
# Extract the location from the listing_slug column
listings_for_sale['suburb'] = listings_for_sale['property_title'].str.extract(
    r'for Sale in (.+)', expand=False)

# Remove the trailing numbers from the suburb column
listings_for_sale['suburb'] = (
    listings_for_sale['suburb'].str.replace(r'-\d+$', '', regex=True).str.strip())

# Print the first five rows of the table to see the results
listings_for_sale.head()

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug,price_cleaned,price (in rands),suburb
0,R 690 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Bhisho Central",3 Bedroom House for Sale in Bhisho Central,RE/MAX Border is proud to present this family ...,Property OverviewListing Number115871179Type o...,3-bedroom-house-for-sale-in-bhisho-central-115...,R690000,690000,Bhisho Central
1,R 700 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu North",3 Bedroom House for Sale in Tyu Tyu North,This spacious property offers a well-designed ...,Property OverviewListing Number115765257Type o...,3-bedroom-house-for-sale-in-tyu-tyu-north-1157...,R700000,700000,Tyu Tyu North
2,R 749 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu North",3 Bedroom House for Sale in Tyu Tyu North,Sizwe Mlungwana Properties is excited to prese...,Property OverviewListing Number115864665Type o...,3-bedroom-house-for-sale-in-tyu-tyu-north-1158...,R749000,749000,Tyu Tyu North
3,R 799 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Bhisho Central",3 Bedroom House for Sale in Bhisho Central,Discover this well located 3 bedroom home.Perf...,Property OverviewListing Number115811196Type o...,3-bedroom-house-for-sale-in-bhisho-central-115...,R799000,799000,Bhisho Central
4,R 695 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu",3 Bedroom House for Sale in Tyu Tyu,Presenting this beautiful house to the market....,Property OverviewListing Number115837796Type o...,3-bedroom-house-for-sale-in-tyu-tyu-115837796,R695000,695000,Tyu Tyu


In [None]:
# Apply the extract_city function again for the listings for sale
listings_for_sale['city'] = listings_for_sale['location'].apply(extract_city)

In [24]:
listings_for_sale['city'].isnull().sum()

np.int64(6602)

In [25]:
# Create a mapping of suburb to kmown cities
suburb_to_city_map = listings_for_sale[listings_for_sale['city'].notna()].drop_duplicates(
    subset=['suburb', 'city']).set_index('suburb')['city'].to_dict()

# Fill missing city values based on the suburb
listings_for_sale['city'] = listings_for_sale.apply(lambda row: suburb_to_city_map.get(
    row['suburb'], row['city']) if pd.isna(row['city']) else row['city'], axis=1)