# <center> 🧽 Data Cleaning </center>

## Table of Contents

- [1. Introduction](#1-introduction)
- [2. Loading Data](#2-loading-data)

## 1. Introduction
[Back to Table of Contents](#table-of-contents)

---

| ⚡ Description: Introducing the project ⚡ |
| :--------------------------- |

---

<a id="two"></a>
## 2. Loading Data
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data in our environment ⚡ |
| :--------------------------- |

---

### Import packages

In [10]:
# Import packages
import numpy as np
import pandas as pd
import re

### Rental Properties

In [11]:
rental_listings = pd.read_csv("../data/raw/rental_listings_for_capital_cities.csv")
display("Rentals", rental_listings.head(), rental_listings.shape)

'Rentals'

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug
0,R 7 000Per Month,"Bhisho, Tyu Tyu North",3 Bedroom House to Rent in Tyu Tyu North,RE/MAX Border is proud to present this 3-bedro...,Property OverviewListing Number115295115Type o...,3-bedroom-house-to-rent-in-tyu-tyu-north-11529...
1,R 7 000Per Month,"Bhisho, Bhisho Central",3 Bedroom House to Rent in Bhisho Central,How CONVENIENT is this flat directly opposite ...,Property OverviewListing Number115332086Type o...,3-bedroom-house-to-rent-in-bhisho-central-1153...
2,R 5 300Per Month,"Bhisho, Tyu Tyu North",2 Bedroom House to Rent in Tyu Tyu North,Comfortable Home for a family which is availab...,Property OverviewListing Number115210602Type o...,2-bedroom-house-to-rent-in-tyu-tyu-north-11521...
3,,,,,,house-to-rent-in-bisho-park-101731048
4,R 7 500Per Month,"Bhisho, Bisho Park",3 Bedroom House to Rent in Bisho Park,BetterProp is thrilled to present this well-ma...,Property OverviewListing Number115555490Type o...,3-bedroom-house-to-rent-in-bisho-park-115555490


(15324, 6)

In [12]:
rental_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15324 entries, 0 to 15323
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   price                 15323 non-null  object
 1   location              8189 non-null   object
 2   property_title        15323 non-null  object
 3   property_description  15323 non-null  object
 4   property_features     15323 non-null  object
 5   listing_slug          15324 non-null  object
dtypes: object(6)
memory usage: 718.4+ KB


In [13]:
rental_listings.isnull().sum()

price                      1
location                7135
property_title             1
property_description       1
property_features          1
listing_slug               0
dtype: int64

In [14]:
rental_null_values = rental_listings.isnull().sum().sum()
print(f"There are {rental_null_values} null values in the rental dataset")

There are 7139 null values in the rental dataset


In [15]:
rental_duplicates = rental_listings[rental_listings.duplicated()]
num_rental_duplicates = len(rental_duplicates)
print(f"There are {num_rental_duplicates} duplicates in the rental dataset ")

There are 1303 duplicates in the rental dataset 


In [16]:
# Drop duplicate rows and keep the first occurrence
rental_listings = rental_listings.drop_duplicates().reset_index(drop=True)

In [17]:
rental_listings[rental_listings['price'].isna()]

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug
3,,,,,,house-to-rent-in-bisho-park-101731048


In [18]:
# Drop any null values in the price column
rental_listings = rental_listings.dropna(subset=['price'])

In [19]:
poa_mask = rental_listings['price'].astype(
    str).str.contains("POA", case=False, na=False)
num_poa = poa_mask.sum()

print(f"There are {num_poa} listings with 'POA' in the price column.")

There are 218 listings with 'POA' in the price column.


In [20]:
# Normalize spaces
rental_listings['price'] = rental_listings['price'].str.replace(
    r'\s+', '', regex=True)

# Extract price amount
rental_listings['price (in rands)'] = rental_listings['price'].str.extract(
    r'R?(\d+)', expand=False)
rental_listings['price (in rands)'] = pd.to_numeric(rental_listings['price (in rands)'], errors='coerce')

# Now safely extract price type from the actual price string
rental_listings['price_type'] = rental_listings['price'].str.extract(
    r'(PerMonth|m2|m²|Perm²)', expand=False
)

# Standardize price type to make it more readable
rental_listings['price_type'] = rental_listings['price_type'].replace({
    'PerMonth': 'Per Month',
    'm2': 'Per m²',
    'Perm²': 'Per m²'
})

# Print out the first five rows of the table to see the results
rental_listings.head()

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug,price (in rands),price_type
0,R7000PerMonth,"Bhisho, Tyu Tyu North",3 Bedroom House to Rent in Tyu Tyu North,RE/MAX Border is proud to present this 3-bedro...,Property OverviewListing Number115295115Type o...,3-bedroom-house-to-rent-in-tyu-tyu-north-11529...,7000.0,Per Month
1,R7000PerMonth,"Bhisho, Bhisho Central",3 Bedroom House to Rent in Bhisho Central,How CONVENIENT is this flat directly opposite ...,Property OverviewListing Number115332086Type o...,3-bedroom-house-to-rent-in-bhisho-central-1153...,7000.0,Per Month
2,R5300PerMonth,"Bhisho, Tyu Tyu North",2 Bedroom House to Rent in Tyu Tyu North,Comfortable Home for a family which is availab...,Property OverviewListing Number115210602Type o...,2-bedroom-house-to-rent-in-tyu-tyu-north-11521...,5300.0,Per Month
4,R7500PerMonth,"Bhisho, Bisho Park",3 Bedroom House to Rent in Bisho Park,BetterProp is thrilled to present this well-ma...,Property OverviewListing Number115555490Type o...,3-bedroom-house-to-rent-in-bisho-park-115555490,7500.0,Per Month
5,R5500PerMonth,"Bhisho, Bisho Park",House to Rent in Bisho Park,This neat family home boasts:3 bedroomsOpen pl...,Property OverviewListing Number115525482Type o...,house-to-rent-in-bisho-park-115525482,5500.0,Per Month


In [21]:
# Extract the location from the property_title column
rental_listings['suburb'] = rental_listings['property_title'].str.extract(
    r'to Rent in (.+)', expand=False)

# Remove the trailing numbers from the suburb column
rental_listings['suburb'] = (
    rental_listings['suburb'].str.replace(r'-\d+$', '', regex=True).str.strip())

# Print the first five rows of the table to see the results
rental_listings.head()

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug,price (in rands),price_type,suburb
0,R7000PerMonth,"Bhisho, Tyu Tyu North",3 Bedroom House to Rent in Tyu Tyu North,RE/MAX Border is proud to present this 3-bedro...,Property OverviewListing Number115295115Type o...,3-bedroom-house-to-rent-in-tyu-tyu-north-11529...,7000.0,Per Month,Tyu Tyu North
1,R7000PerMonth,"Bhisho, Bhisho Central",3 Bedroom House to Rent in Bhisho Central,How CONVENIENT is this flat directly opposite ...,Property OverviewListing Number115332086Type o...,3-bedroom-house-to-rent-in-bhisho-central-1153...,7000.0,Per Month,Bhisho Central
2,R5300PerMonth,"Bhisho, Tyu Tyu North",2 Bedroom House to Rent in Tyu Tyu North,Comfortable Home for a family which is availab...,Property OverviewListing Number115210602Type o...,2-bedroom-house-to-rent-in-tyu-tyu-north-11521...,5300.0,Per Month,Tyu Tyu North
4,R7500PerMonth,"Bhisho, Bisho Park",3 Bedroom House to Rent in Bisho Park,BetterProp is thrilled to present this well-ma...,Property OverviewListing Number115555490Type o...,3-bedroom-house-to-rent-in-bisho-park-115555490,7500.0,Per Month,Bisho Park
5,R5500PerMonth,"Bhisho, Bisho Park",House to Rent in Bisho Park,This neat family home boasts:3 bedroomsOpen pl...,Property OverviewListing Number115525482Type o...,house-to-rent-in-bisho-park-115525482,5500.0,Per Month,Bisho Park


In [22]:
rental_listings['suburb'].isnull().sum()

np.int64(0)

In [23]:
city_province = pd.read_csv("../data/raw/south_africa_provinces_capitals.csv")
city_list = city_province['city'].dropna().str.strip().unique()

In [24]:
def extract_city_from_location(location, city_list):
    if pd.isna(location):
        return None
    
    
    location_lower = location.lower()
    for city in city_list:
        if city.lower() in location_lower:
            return city
    return None

In [25]:
# Apply the above function to extract the function
rental_listings['city'] = rental_listings['location'].apply(
    lambda loc: extract_city_from_location(loc, city_list))

In [26]:
rental_listings['city'].value_counts()

city
Johannesburg        2544
Cape Town           2379
Bloemfontein         855
Polokwane            562
Nelspruit            531
Pietermaritzburg     458
Kimberley            136
Mafikeng              26
Bhisho                 5
Name: count, dtype: int64

In [27]:
rental_listings['city'].isnull().sum()

np.int64(6524)

In [28]:
# Create a mapping of suburb to kmown cities
suburb_to_city_map = rental_listings[rental_listings['city'].notna()].drop_duplicates(subset=['suburb', 'city']).set_index('suburb')['city'].to_dict()

# Fill missing city values based on the suburb
rental_listings['city'] = rental_listings.apply(lambda row: suburb_to_city_map.get(row['suburb'], row['city']) if pd.isna(row['city']) else row['city'], axis= 1)   

In [29]:
rental_listings['city'].isnull().sum()

np.int64(91)

In [30]:
# Drop the city rows with null values
rental_listings = rental_listings.dropna(subset=['city'])

In [31]:
# Merge the city with its matching province
rental_listings = rental_listings.merge(city_province, on='city', how='left')

In [32]:
rental_listings.isnull().sum()

price                      0
location                6420
property_title             0
property_description       0
property_features          0
listing_slug               0
price (in rands)         217
price_type               155
suburb                     0
city                       0
province                   0
dtype: int64

In [33]:
# Suburbs 
rental_listings['suburb'].value_counts()

suburb
Cape Town City Centre    1077
Rosebank                  639
Woodstock                 350
Johannesburg Central      296
Claremont                 275
                         ... 
Lebowakgomo                 1
Lebowakgomo Zone F          1
Silwerkruin                 1
Bhisho Central              1
Nirvana Ext 3               1
Name: count, Length: 598, dtype: int64

In [34]:
rental_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13929 entries, 0 to 13928
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   price                 13929 non-null  object 
 1   location              7509 non-null   object 
 2   property_title        13929 non-null  object 
 3   property_description  13929 non-null  object 
 4   property_features     13929 non-null  object 
 5   listing_slug          13929 non-null  object 
 6   price (in rands)      13712 non-null  float64
 7   price_type            13774 non-null  object 
 8   suburb                13929 non-null  object 
 9   city                  13929 non-null  object 
 10  province              13929 non-null  object 
dtypes: float64(1), object(10)
memory usage: 1.2+ MB


In [35]:
rental_listings["property_features"] = rental_listings["property_features"].str.replace(
    r'Points of Interest.*', '', regex=True
)

rental_listings["property_features"] = rental_listings["property_features"].str.strip()

In [36]:
rental_listings.head(10)

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug,price (in rands),price_type,suburb,city,province
0,R7000PerMonth,"Bhisho, Tyu Tyu North",3 Bedroom House to Rent in Tyu Tyu North,RE/MAX Border is proud to present this 3-bedro...,Property OverviewListing Number115295115Type o...,3-bedroom-house-to-rent-in-tyu-tyu-north-11529...,7000.0,Per Month,Tyu Tyu North,Bhisho,Eastern Cape
1,R7000PerMonth,"Bhisho, Bhisho Central",3 Bedroom House to Rent in Bhisho Central,How CONVENIENT is this flat directly opposite ...,Property OverviewListing Number115332086Type o...,3-bedroom-house-to-rent-in-bhisho-central-1153...,7000.0,Per Month,Bhisho Central,Bhisho,Eastern Cape
2,R5300PerMonth,"Bhisho, Tyu Tyu North",2 Bedroom House to Rent in Tyu Tyu North,Comfortable Home for a family which is availab...,Property OverviewListing Number115210602Type o...,2-bedroom-house-to-rent-in-tyu-tyu-north-11521...,5300.0,Per Month,Tyu Tyu North,Bhisho,Eastern Cape
3,R7500PerMonth,"Bhisho, Bisho Park",3 Bedroom House to Rent in Bisho Park,BetterProp is thrilled to present this well-ma...,Property OverviewListing Number115555490Type o...,3-bedroom-house-to-rent-in-bisho-park-115555490,7500.0,Per Month,Bisho Park,Bhisho,Eastern Cape
4,R5500PerMonth,"Bhisho, Bisho Park",House to Rent in Bisho Park,This neat family home boasts:3 bedroomsOpen pl...,Property OverviewListing Number115525482Type o...,house-to-rent-in-bisho-park-115525482,5500.0,Per Month,Bisho Park,Bhisho,Eastern Cape
5,R16000PerMonth,"Bloemfontein, Groenvlei SH",4 Bedroom Townhouse to Rent in Groenvlei SH,Modern living with comfort and convenience!Thi...,Property OverviewListing Number115649514Type o...,4-bedroom-townhouse-to-rent-in-groenvlei-sh-11...,16000.0,Per Month,Groenvlei SH,Bloemfontein,Free State
6,R28000PerMonth,"Bloemfontein, Hamilton",Industrial Property to Rent in Hamilton,This well-located and highly functional wareho...,Property OverviewListing Number115626383Type o...,industrial-property-to-rent-in-hamilton-115626383,28000.0,Per Month,Hamilton,Bloemfontein,Free State
7,R86000PerMonth,"Bloemfontein, Hamilton",Industrial Property to Rent in Hamilton,"This neat and secure 2000m2 warehouse, situate...",Property OverviewListing Number115626381Type o...,industrial-property-to-rent-in-hamilton-115626381,86000.0,Per Month,Hamilton,Bloemfontein,Free State
8,R13990PerMonth,"Bloemfontein, Shellyvale",3 Bedroom Townhouse to Rent in Shellyvale,This delightful standalone home in the highly ...,Property OverviewListing Number115103630Type o...,3-bedroom-townhouse-to-rent-in-shellyvale-1151...,13990.0,Per Month,Shellyvale,Bloemfontein,Free State
9,R18000PerMonth,"Bloemfontein, Woodland Hills Bergendal",3 Bedroom Townhouse to Rent in Woodland Hills ...,"Experience luxury living in this brand-new, do...",Property OverviewListing Number113730436Type o...,3-bedroom-townhouse-to-rent-in-woodland-hills-...,18000.0,Per Month,Woodland Hills Bergendal,Bloemfontein,Free State


In [37]:
patterns = {
    'Listing Number': r'Listing Number(\d+)',
    'Type of Property': r'Type of Property\s*(House|Apartment / Flat|Townhouse|Commercial|Industrial|Vacant Land|Farm)',
    #'Description': r'Description([A-Za-z]+)',
    #'Lifestyle': r'Lifestyle([A-Za-z]+)',
    #'Occupation Date': r'Occupation Date([\d\w\s]+?)Deposit Requirements',
    #'Deposit Amount': r'Deposit Amount:\s*R\s?([\d,]+)',
    'Pets Allowed': r'Pets Allowed(Yes|No)',
    'Furnished': r'Furnished(Yes|No)',
    'Bedrooms': r'Bedrooms?\s*(\d+)',
    'Bathrooms': r'Bathrooms?\s*(\d+)',
    'Kitchen': r'Kitchens?\s*(\d+)',
    'Lounge': r'Lounges?\s*(\d+)',
    'Dining Room': r'Dining Room(\d+)',
    'Parking': r'Parking(\d+)',
    #'Special Features': r'Special Features(\d+)',
    #'Backup Water': r'Backup Water(Yes|No)',
    'Erf Size (m²)': r'Erf Size\s*(\d+)\s*m²',
    'Floor Area (m²)': r'Floor Area\s*(\d+)\s*m²',
    'Garden': r'Gardens?\s*(\d+)',
    'Pool': r'Pools?\s*(\d+)',
    'Garage': r'Garages?\s*(\d+)'
}



# Function to extract features from text
def extract_features(text):
    return {
        field: (match := re.search(pattern, text)
                ) and match.group(1).strip() or None for field, pattern in patterns.items()
    }
    
# Apply extraction row-wise
extracted_df = rental_listings['property_features'].apply(extract_features).apply(pd.Series)

# Combine original df with the extracted columns
results_df = pd.concat([rental_listings, extracted_df], axis=1)

# View result
print(results_df)

                price                        location  \
0       R7000PerMonth           Bhisho, Tyu Tyu North   
1       R7000PerMonth          Bhisho, Bhisho Central   
2       R5300PerMonth           Bhisho, Tyu Tyu North   
3       R7500PerMonth              Bhisho, Bisho Park   
4       R5500PerMonth              Bhisho, Bisho Park   
...               ...                             ...   
13924  R12770PerMonth            Polokwane, Moregloed   
13925  R70631PerMonth  Polokwane, Polokwane Industria   
13926        R72Perm²              Polokwane, Nirvana   
13927  R19200PerMonth    Polokwane, Polokwane Central   
13928  R13500PerMonth               Polokwane, Bendor   

                                          property_title  \
0               3 Bedroom House to Rent in Tyu Tyu North   
1              3 Bedroom House to Rent in Bhisho Central   
2               2 Bedroom House to Rent in Tyu Tyu North   
3                  3 Bedroom House to Rent in Bisho Park   
4              

In [40]:
results_df['Erf Size (m²)'].value_counts()

Erf Size (m²)
495    68
500    41
250    39
100    34
800    34
       ..
922     1
689     1
877     1
474     1
416     1
Name: count, Length: 716, dtype: int64

In [None]:
results_df.head(20)

In [None]:
results_df['Parking'].value_counts()

In [None]:
# Set display options
pd.set_option('display.max_columns', None)        # Show all columns
# Show all rows (if you're looking at multiple)
pd.set_option('display.max_rows', None)
# Show full content in each cell
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', None)              # Prevent line wrapping

In [None]:
results_df["Parking"] = pd.to_numeric(results_df["Parking"], errors='coerce')
results_df[results_df["Parking"] == 3].sort_values("Parking")

In [None]:
results_df['Garage'].value_counts()

In [None]:
def extract_bedrooms_from_property_title(title):
    if pd.isna(title):
        return None
    match = re.search(r'(\d+)[-\s]?bedroom', title.lower())
    return int(match.group(1)) if match else None

results_df['Bedrooms'] = results_df['Bedrooms'].fillna(
    results_df.apply(lambda row: extract_bedrooms_from_property_title(row['property_title']) if pd.isna(row['Bedrooms']) else row['Bedrooms'], axis=1))

# Optional: convert to integer if needed
results_df['Bedrooms'] = pd.to_numeric(results_df['Bedrooms'], errors='coerce').astype('Int64')

In [None]:
results_df['Bedrooms'].isnull().sum()

In [None]:
results_df['Bedrooms'].isnull().sum()

In [None]:
results_df.info()

In [None]:
results_df.head(20)

In [None]:
results_df['Furnished'].value_counts()

In [None]:
results_df['Type of Property'].value_counts()

### Listings for sale

In [None]:
listings_for_sale = pd.read_csv("../data/raw/listings_for_sale_in_capital_cities.csv")
display("Listings for sale", listings_for_sale.head(), listings_for_sale.shape)

In [None]:
# Get information about the listings_for_sale document
listings_for_sale.info()

In [None]:
# See the number of null values in each column
listings_for_sale.isnull().sum()

In [None]:
# Drop listings where the property_title is null
listings_for_sale = listings_for_sale.dropna(subset=['property_title']).reset_index(drop=True)

In [None]:
listings_for_sale_duplicates = listings_for_sale[listings_for_sale.duplicated()]
num_listings_for_sale_duplicates = len(listings_for_sale_duplicates)
print(f"There are {num_listings_for_sale_duplicates} duplicates in the rental dataset ")

In [None]:
# Drop duplicate rows and keep the first occurrence
listings_for_sale = listings_for_sale.drop_duplicates().reset_index(drop=True)

In [None]:
print(listings_for_sale['price'].str.contains(
    'POA', case=False, na=False).sum())

In [None]:
"""_summary_
# Remove rows where 'price' contains 'POA' (case-insensitive)
listings_for_sale = listings_for_sale[~listings_for_sale['price'].astype(
    str).str.contains('POA', case=False, na=False)]
rental_listings.reset_index(drop=True, inplace=True)
    """

In [None]:
listings_for_sale['price_cleaned'] = listings_for_sale['price'].str.extract(
    r'(R\s?\d[\d\s]*)')

listings_for_sale['price_cleaned'] = listings_for_sale['price_cleaned'].str.replace(
    r'\s+', '', regex=True)

listings_for_sale['price (in rands)'] = listings_for_sale['price_cleaned'].str.extract(
    r'R?(\d+)', expand=False)

# Print out the first five rows to see the results
listings_for_sale.head()

In [None]:
# Extract the location from the listing_slug column
listings_for_sale['suburb'] = listings_for_sale['property_title'].str.extract(
    r'for Sale in (.+)', expand=False)

# Remove the trailing numbers from the suburb column
listings_for_sale['suburb'] = (
    listings_for_sale['suburb'].str.replace(r'-\d+$', '', regex=True).str.strip())

# Print the first five rows of the table to see the results
listings_for_sale.head()

In [None]:
# Apply the extract_city_from_location function to extract the function
listings_for_sale['city'] = listings_for_sale['location'].apply(
    lambda loc: extract_city_from_location(loc, city_list))

In [None]:
listings_for_sale['city'].isnull().sum()

In [None]:
# Create a mapping of suburb to kmown cities
suburb_to_city_map = listings_for_sale[listings_for_sale['city'].notna()].drop_duplicates(
    subset=['suburb', 'city']).set_index('suburb')['city'].to_dict()

# Fill missing city values based on the suburb
listings_for_sale['city'] = listings_for_sale.apply(lambda row: suburb_to_city_map.get(
    row['suburb'], row['city']) if pd.isna(row['city']) else row['city'], axis=1)

In [None]:
# See the number of unique values in the city column
listings_for_sale['city'].value_counts()

In [None]:
listings_for_sale.isnull().sum()

In [None]:
# Drop the rows where the city_column is empty
listings_for_sale = listings_for_sale.dropna(subset=['city'])

In [None]:
listings_for_sale.isnull().sum()

In [None]:
# Merge to get the matching province for the listings
listings_for_sale = listings_for_sale.merge(city_province, on='city', how='left')

In [None]:
listings_for_sale.isnull().sum()

In [None]:
poa_mask = listings_for_sale['price'].astype(
    str).str.contains("POA", case=False, na=False)
num_poa = poa_mask.sum()

print(f"There are {num_poa} listings with 'POA' in the price column.")