# <center> 🧽 Data Cleaning </center>

## Table of Contents

- [1. Introduction](#1-introduction)
- [2. Loading Data](#2-loading-data)

## 1. Introduction
[Back to Table of Contents](#table-of-contents)

---

| ⚡ Description: Introducing the project ⚡ |
| :--------------------------- |

---

<a id="two"></a>
## 2. Loading Data
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data in our environment ⚡ |
| :--------------------------- |

---

### Import packages

In [1]:
# Import packages
import numpy as np
import pandas as pd
import re

### Rental Properties

In [2]:
rental_listings = pd.read_csv("../data/raw/rental_listings_for_capital_cities.csv")
display("Rentals", rental_listings.head(), rental_listings.shape)

'Rentals'

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug
0,R 7 000Per Month,"Bhisho, Tyu Tyu North",3 Bedroom House to Rent in Tyu Tyu North,RE/MAX Border is proud to present this 3-bedro...,Property OverviewListing Number115295115Type o...,3-bedroom-house-to-rent-in-tyu-tyu-north-11529...
1,R 7 000Per Month,"Bhisho, Bhisho Central",3 Bedroom House to Rent in Bhisho Central,How CONVENIENT is this flat directly opposite ...,Property OverviewListing Number115332086Type o...,3-bedroom-house-to-rent-in-bhisho-central-1153...
2,R 5 300Per Month,"Bhisho, Tyu Tyu North",2 Bedroom House to Rent in Tyu Tyu North,Comfortable Home for a family which is availab...,Property OverviewListing Number115210602Type o...,2-bedroom-house-to-rent-in-tyu-tyu-north-11521...
3,,,,,,house-to-rent-in-bisho-park-101731048
4,R 7 500Per Month,"Bhisho, Bisho Park",3 Bedroom House to Rent in Bisho Park,BetterProp is thrilled to present this well-ma...,Property OverviewListing Number115555490Type o...,3-bedroom-house-to-rent-in-bisho-park-115555490


(15324, 6)

In [3]:
rental_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15324 entries, 0 to 15323
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   price                 15323 non-null  object
 1   location              8189 non-null   object
 2   property_title        15323 non-null  object
 3   property_description  15323 non-null  object
 4   property_features     15323 non-null  object
 5   listing_slug          15324 non-null  object
dtypes: object(6)
memory usage: 718.4+ KB


In [4]:
rental_listings.isnull().sum()

price                      1
location                7135
property_title             1
property_description       1
property_features          1
listing_slug               0
dtype: int64

In [5]:
rental_null_values = rental_listings.isnull().sum().sum()
print(f"There are {rental_null_values} null values in the rental dataset")

There are 7139 null values in the rental dataset


In [6]:
rental_duplicates = rental_listings[rental_listings.duplicated()]
num_rental_duplicates = len(rental_duplicates)
print(f"There are {num_rental_duplicates} duplicates in the rental dataset ")

There are 1303 duplicates in the rental dataset 


In [7]:
# Drop duplicate rows and keep the first occurrence
rental_listings = rental_listings.drop_duplicates().reset_index(drop=True)

In [8]:
rental_listings = rental_listings.dropna(subset=['price'])

In [46]:
poa_mask = rental_listings['price'].astype(
    str).str.contains("POA", case=False, na=False)
num_poa = poa_mask.sum()

print(f"There are {num_poa} listings with 'POA' in the price column.")

There are 217 listings with 'POA' in the price column.


In [9]:
# Normalize spaces
rental_listings['price'] = rental_listings['price'].str.replace(
    r'\s+', '', regex=True)

# Extract price amount
rental_listings['price (in rands)'] = rental_listings['price'].str.extract(
    r'R?(\d+)', expand=False)
rental_listings['price (in rands)'] = pd.to_numeric(rental_listings['price (in rands)'], errors='coerce')

# Now safely extract price type from the actual price string
rental_listings['price_type'] = rental_listings['price'].str.extract(
    r'(PerMonth|m2|m²|Perm²)', expand=False
)

# Standardize price type to make it more readable
rental_listings['price_type'] = rental_listings['price_type'].replace({
    'PerMonth': 'Per Month',
    'm2': 'Per m²',
    'Perm²': 'Per m²'
})

# Print out the first five rows of the table to see the results
rental_listings.head()

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug,price (in rands),price_type
0,R7000PerMonth,"Bhisho, Tyu Tyu North",3 Bedroom House to Rent in Tyu Tyu North,RE/MAX Border is proud to present this 3-bedro...,Property OverviewListing Number115295115Type o...,3-bedroom-house-to-rent-in-tyu-tyu-north-11529...,7000.0,Per Month
1,R7000PerMonth,"Bhisho, Bhisho Central",3 Bedroom House to Rent in Bhisho Central,How CONVENIENT is this flat directly opposite ...,Property OverviewListing Number115332086Type o...,3-bedroom-house-to-rent-in-bhisho-central-1153...,7000.0,Per Month
2,R5300PerMonth,"Bhisho, Tyu Tyu North",2 Bedroom House to Rent in Tyu Tyu North,Comfortable Home for a family which is availab...,Property OverviewListing Number115210602Type o...,2-bedroom-house-to-rent-in-tyu-tyu-north-11521...,5300.0,Per Month
4,R7500PerMonth,"Bhisho, Bisho Park",3 Bedroom House to Rent in Bisho Park,BetterProp is thrilled to present this well-ma...,Property OverviewListing Number115555490Type o...,3-bedroom-house-to-rent-in-bisho-park-115555490,7500.0,Per Month
5,R5500PerMonth,"Bhisho, Bisho Park",House to Rent in Bisho Park,This neat family home boasts:3 bedroomsOpen pl...,Property OverviewListing Number115525482Type o...,house-to-rent-in-bisho-park-115525482,5500.0,Per Month


In [10]:
# Extract the location from the listing_slug column
rental_listings['suburb'] = rental_listings['property_title'].str.extract(
    r'to Rent in (.+)', expand=False)

# Remove the trailing numbers from the suburb column
rental_listings['suburb'] = (
    rental_listings['suburb'].str.replace(r'-\d+$', '', regex=True).str.strip())

# Print the first five rows of the table to see the results
rental_listings.head()

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug,price (in rands),price_type,suburb
0,R7000PerMonth,"Bhisho, Tyu Tyu North",3 Bedroom House to Rent in Tyu Tyu North,RE/MAX Border is proud to present this 3-bedro...,Property OverviewListing Number115295115Type o...,3-bedroom-house-to-rent-in-tyu-tyu-north-11529...,7000.0,Per Month,Tyu Tyu North
1,R7000PerMonth,"Bhisho, Bhisho Central",3 Bedroom House to Rent in Bhisho Central,How CONVENIENT is this flat directly opposite ...,Property OverviewListing Number115332086Type o...,3-bedroom-house-to-rent-in-bhisho-central-1153...,7000.0,Per Month,Bhisho Central
2,R5300PerMonth,"Bhisho, Tyu Tyu North",2 Bedroom House to Rent in Tyu Tyu North,Comfortable Home for a family which is availab...,Property OverviewListing Number115210602Type o...,2-bedroom-house-to-rent-in-tyu-tyu-north-11521...,5300.0,Per Month,Tyu Tyu North
4,R7500PerMonth,"Bhisho, Bisho Park",3 Bedroom House to Rent in Bisho Park,BetterProp is thrilled to present this well-ma...,Property OverviewListing Number115555490Type o...,3-bedroom-house-to-rent-in-bisho-park-115555490,7500.0,Per Month,Bisho Park
5,R5500PerMonth,"Bhisho, Bisho Park",House to Rent in Bisho Park,This neat family home boasts:3 bedroomsOpen pl...,Property OverviewListing Number115525482Type o...,house-to-rent-in-bisho-park-115525482,5500.0,Per Month,Bisho Park


In [11]:
city_province = pd.read_csv("../data/raw/south_africa_provinces_capitals.csv")
city_list = city_province['city'].dropna().str.strip().unique()

In [12]:
def extract_city_from_location(location, city_list):
    if pd.isna(location):
        return None
    
    
    location_lower = location.lower()
    for city in city_list:
        if city.lower() in location_lower:
            return city
    return None

In [13]:
# Apply the above function to extract the function
rental_listings['city'] = rental_listings['location'].apply(
    lambda loc: extract_city_from_location(loc, city_list))

In [14]:
rental_listings['city'].value_counts()

city
Johannesburg        2544
Cape Town           2379
Bloemfontein         855
Polokwane            562
Nelspruit            531
Pietermaritzburg     458
Kimberley            136
Mafikeng              26
Bhisho                 5
Name: count, dtype: int64

In [15]:
rental_listings['city'].isnull().sum()

np.int64(6524)

In [17]:
# Create a mapping of suburb to kmown cities
suburb_to_city_map = rental_listings[rental_listings['city'].notna()].drop_duplicates(subset=['suburb', 'city']).set_index('suburb')['city'].to_dict()

# Fill missing city values based on the suburb
rental_listings['city'] = rental_listings.apply(lambda row: suburb_to_city_map.get(row['suburb'], row['city']) if pd.isna(row['city']) else row['city'], axis= 1)   

In [18]:
rental_listings['city'].isnull().sum()

np.int64(91)

In [19]:
# Drop the city rows with null values
rental_listings = rental_listings.dropna(subset=['city'])

In [None]:
# Merge the city with its matching province
rental_listings = rental_listings.merge(city_province, on='city', how='left')

In [21]:
rental_listings.isnull().sum()

price                      0
location                6420
property_title             0
property_description       0
property_features          0
listing_slug               0
price (in rands)         217
price_type               155
suburb                     0
city                       0
province                   0
dtype: int64

In [23]:
rental_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13929 entries, 0 to 13928
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   price                 13929 non-null  object 
 1   location              7509 non-null   object 
 2   property_title        13929 non-null  object 
 3   property_description  13929 non-null  object 
 4   property_features     13929 non-null  object 
 5   listing_slug          13929 non-null  object 
 6   price (in rands)      13712 non-null  float64
 7   price_type            13774 non-null  object 
 8   suburb                13929 non-null  object 
 9   city                  13929 non-null  object 
 10  province              13929 non-null  object 
dtypes: float64(1), object(10)
memory usage: 1.2+ MB


### Listings for sale

In [24]:
listings_for_sale = pd.read_csv("../data/raw/listings_for_sale_in_capital_cities.csv")
display("Listings for sale", listings_for_sale.head(), listings_for_sale.shape)

'Listings for sale'

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug
0,R 690 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Bhisho Central",3 Bedroom House for Sale in Bhisho Central,RE/MAX Border is proud to present this family ...,Property OverviewListing Number115871179Type o...,3-bedroom-house-for-sale-in-bhisho-central-115...
1,R 700 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu North",3 Bedroom House for Sale in Tyu Tyu North,This spacious property offers a well-designed ...,Property OverviewListing Number115765257Type o...,3-bedroom-house-for-sale-in-tyu-tyu-north-1157...
2,R 749 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu North",3 Bedroom House for Sale in Tyu Tyu North,Sizwe Mlungwana Properties is excited to prese...,Property OverviewListing Number115864665Type o...,3-bedroom-house-for-sale-in-tyu-tyu-north-1158...
3,R 799 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Bhisho Central",3 Bedroom House for Sale in Bhisho Central,Discover this well located 3 bedroom home.Perf...,Property OverviewListing Number115811196Type o...,3-bedroom-house-for-sale-in-bhisho-central-115...
4,R 695 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu",3 Bedroom House for Sale in Tyu Tyu,Presenting this beautiful house to the market....,Property OverviewListing Number115837796Type o...,3-bedroom-house-for-sale-in-tyu-tyu-115837796


(40006, 6)

In [None]:
# Get information about the listings_for_sale document
listings_for_sale.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40006 entries, 0 to 40005
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   price                 40002 non-null  object
 1   location              33028 non-null  object
 2   property_title        40002 non-null  object
 3   property_description  40001 non-null  object
 4   property_features     40002 non-null  object
 5   listing_slug          40006 non-null  object
dtypes: object(6)
memory usage: 1.8+ MB


In [None]:
# See the number of null values in each column
listings_for_sale.isnull().sum()

price                      4
location                6978
property_title             4
property_description       5
property_features          4
listing_slug               0
dtype: int64

In [27]:
# Drop listings where the property_title is null
listings_for_sale = listings_for_sale.dropna(subset=['property_title']).reset_index(drop=True)

In [28]:
listings_for_sale_duplicates = listings_for_sale[listings_for_sale.duplicated()]
num_listings_for_sale_duplicates = len(listings_for_sale_duplicates)
print(f"There are {num_listings_for_sale_duplicates} duplicates in the rental dataset ")

There are 2072 duplicates in the rental dataset 


In [29]:
# Drop duplicate rows and keep the first occurrence
listings_for_sale = listings_for_sale.drop_duplicates().reset_index(drop=True)

In [None]:
print(listings_for_sale['price'].str.contains(
    'POA', case=False, na=False).sum())

In [None]:
"""_summary_
# Remove rows where 'price' contains 'POA' (case-insensitive)
listings_for_sale = listings_for_sale[~listings_for_sale['price'].astype(
    str).str.contains('POA', case=False, na=False)]
rental_listings.reset_index(drop=True, inplace=True)
    """

In [30]:
listings_for_sale['price_cleaned'] = listings_for_sale['price'].str.extract(
    r'(R\s?\d[\d\s]*)')

listings_for_sale['price_cleaned'] = listings_for_sale['price_cleaned'].str.replace(
    r'\s+', '', regex=True)

listings_for_sale['price (in rands)'] = listings_for_sale['price_cleaned'].str.extract(
    r'R?(\d+)', expand=False)

# Print out the first five rows to see the results
listings_for_sale.head()

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug,price_cleaned,price (in rands)
0,R 690 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Bhisho Central",3 Bedroom House for Sale in Bhisho Central,RE/MAX Border is proud to present this family ...,Property OverviewListing Number115871179Type o...,3-bedroom-house-for-sale-in-bhisho-central-115...,R690000,690000
1,R 700 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu North",3 Bedroom House for Sale in Tyu Tyu North,This spacious property offers a well-designed ...,Property OverviewListing Number115765257Type o...,3-bedroom-house-for-sale-in-tyu-tyu-north-1157...,R700000,700000
2,R 749 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu North",3 Bedroom House for Sale in Tyu Tyu North,Sizwe Mlungwana Properties is excited to prese...,Property OverviewListing Number115864665Type o...,3-bedroom-house-for-sale-in-tyu-tyu-north-1158...,R749000,749000
3,R 799 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Bhisho Central",3 Bedroom House for Sale in Bhisho Central,Discover this well located 3 bedroom home.Perf...,Property OverviewListing Number115811196Type o...,3-bedroom-house-for-sale-in-bhisho-central-115...,R799000,799000
4,R 695 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu",3 Bedroom House for Sale in Tyu Tyu,Presenting this beautiful house to the market....,Property OverviewListing Number115837796Type o...,3-bedroom-house-for-sale-in-tyu-tyu-115837796,R695000,695000


In [31]:
# Extract the location from the listing_slug column
listings_for_sale['suburb'] = listings_for_sale['property_title'].str.extract(
    r'for Sale in (.+)', expand=False)

# Remove the trailing numbers from the suburb column
listings_for_sale['suburb'] = (
    listings_for_sale['suburb'].str.replace(r'-\d+$', '', regex=True).str.strip())

# Print the first five rows of the table to see the results
listings_for_sale.head()

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug,price_cleaned,price (in rands),suburb
0,R 690 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Bhisho Central",3 Bedroom House for Sale in Bhisho Central,RE/MAX Border is proud to present this family ...,Property OverviewListing Number115871179Type o...,3-bedroom-house-for-sale-in-bhisho-central-115...,R690000,690000,Bhisho Central
1,R 700 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu North",3 Bedroom House for Sale in Tyu Tyu North,This spacious property offers a well-designed ...,Property OverviewListing Number115765257Type o...,3-bedroom-house-for-sale-in-tyu-tyu-north-1157...,R700000,700000,Tyu Tyu North
2,R 749 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu North",3 Bedroom House for Sale in Tyu Tyu North,Sizwe Mlungwana Properties is excited to prese...,Property OverviewListing Number115864665Type o...,3-bedroom-house-for-sale-in-tyu-tyu-north-1158...,R749000,749000,Tyu Tyu North
3,R 799 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Bhisho Central",3 Bedroom House for Sale in Bhisho Central,Discover this well located 3 bedroom home.Perf...,Property OverviewListing Number115811196Type o...,3-bedroom-house-for-sale-in-bhisho-central-115...,R799000,799000,Bhisho Central
4,R 695 000Bond CostsBond CalculatorPurchase Pri...,"Bhisho, Tyu Tyu",3 Bedroom House for Sale in Tyu Tyu,Presenting this beautiful house to the market....,Property OverviewListing Number115837796Type o...,3-bedroom-house-for-sale-in-tyu-tyu-115837796,R695000,695000,Tyu Tyu


In [None]:
# Apply the extract_city_from_location function to extract the function
listings_for_sale['city'] = listings_for_sale['location'].apply(
    lambda loc: extract_city_from_location(loc, city_list))

In [33]:
listings_for_sale['city'].isnull().sum()

np.int64(6604)

In [34]:
# Create a mapping of suburb to kmown cities
suburb_to_city_map = listings_for_sale[listings_for_sale['city'].notna()].drop_duplicates(
    subset=['suburb', 'city']).set_index('suburb')['city'].to_dict()

# Fill missing city values based on the suburb
listings_for_sale['city'] = listings_for_sale.apply(lambda row: suburb_to_city_map.get(
    row['suburb'], row['city']) if pd.isna(row['city']) else row['city'], axis=1)

In [None]:
# See the number of unique values in the city column
listings_for_sale['city'].value_counts()

city
Johannesburg        12561
Bloemfontein         6808
Cape Town            5615
Polokwane            4257
Pietermaritzburg     3426
Nelspruit            2504
Kimberley            2447
Mafikeng              165
Bhisho                 88
Name: count, dtype: int64

In [36]:
listings_for_sale.isnull().sum()

price                      0
location                6601
property_title             0
property_description       1
property_features          0
listing_slug               0
price_cleaned            542
price (in rands)         542
suburb                     0
city                      59
dtype: int64

In [None]:
listings_for_sale[listings_for_sale]

In [38]:
# Drop the rows where the city_column is empty
listings_for_sale = listings_for_sale.dropna(subset=['city'])

In [39]:
listings_for_sale.isnull().sum()

price                      0
location                6542
property_title             0
property_description       1
property_features          0
listing_slug               0
price_cleaned            540
price (in rands)         540
suburb                     0
city                       0
dtype: int64

In [None]:
# Merge to get the matching province for the listings
listings_for_sale = listings_for_sale.merge(city_province, on='city', how='left')

In [41]:
listings_for_sale.isnull().sum()

price                      0
location                6542
property_title             0
property_description       1
property_features          0
listing_slug               0
price_cleaned            540
price (in rands)         540
suburb                     0
city                       0
province                   0
dtype: int64

In [47]:
poa_mask = listings_for_sale['price'].astype(
    str).str.contains("POA", case=False, na=False)
num_poa = poa_mask.sum()

print(f"There are {num_poa} listings with 'POA' in the price column.")

There are 540 listings with 'POA' in the price column.
