# <center> 🧽 Data Cleaning </center>

## Table of Contents

- [1. Introduction](#1-introduction)
- [2. Loading Data](#2-loading-data)

## 1. Introduction
[Back to Table of Contents](#table-of-contents)

---

| ⚡ Description: Introducing the project ⚡ |
| :--------------------------- |

---

<a id="two"></a>
## 2. Loading Data
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data in our environment ⚡ |
| :--------------------------- |

---

### Import packages

In [1]:
# Import packages
import re

import numpy as np
import pandas as pd

### Rental Properties

In [2]:
# Load the rental listings dataset
rental_listings = pd.read_csv("../data/raw/rental_listings_for_capital_cities.csv")

# Display the first few rows of the dataset and it's shape
display("Rentals", rental_listings.head(), rental_listings.shape)

'Rentals'

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug
0,R 7 000Per Month,"Bhisho, Tyu Tyu North",3 Bedroom House to Rent in Tyu Tyu North,RE/MAX Border is proud to present this 3-bedro...,Property OverviewListing Number115295115Type o...,3-bedroom-house-to-rent-in-tyu-tyu-north-11529...
1,R 7 000Per Month,"Bhisho, Bhisho Central",3 Bedroom House to Rent in Bhisho Central,How CONVENIENT is this flat directly opposite ...,Property OverviewListing Number115332086Type o...,3-bedroom-house-to-rent-in-bhisho-central-1153...
2,R 5 300Per Month,"Bhisho, Tyu Tyu North",2 Bedroom House to Rent in Tyu Tyu North,Comfortable Home for a family which is availab...,Property OverviewListing Number115210602Type o...,2-bedroom-house-to-rent-in-tyu-tyu-north-11521...
3,,,,,,house-to-rent-in-bisho-park-101731048
4,R 7 500Per Month,"Bhisho, Bisho Park",3 Bedroom House to Rent in Bisho Park,BetterProp is thrilled to present this well-ma...,Property OverviewListing Number115555490Type o...,3-bedroom-house-to-rent-in-bisho-park-115555490


(15324, 6)

In [3]:
# Print a summary of the dataset
rental_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15324 entries, 0 to 15323
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   price                 15323 non-null  object
 1   location              8189 non-null   object
 2   property_title        15323 non-null  object
 3   property_description  15323 non-null  object
 4   property_features     15323 non-null  object
 5   listing_slug          15324 non-null  object
dtypes: object(6)
memory usage: 718.4+ KB


In [4]:
# Print the total number of null values in the dataset
rental_null_values = rental_listings.isnull().sum().sum()
print(f"There are {rental_null_values} null values in the rental dataset")

There are 7139 null values in the rental dataset


In [5]:
# Print the number of null values in each column
rental_listings.isnull().sum()

price                      1
location                7135
property_title             1
property_description       1
property_features          1
listing_slug               0
dtype: int64

In [6]:
rental_listings[rental_listings.isnull().sum(axis=1) >= 4]

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug
3,,,,,,house-to-rent-in-bisho-park-101731048


In [7]:
rental_listings.drop(
    rental_listings[rental_listings.isnull().sum(axis=1) >= 2].index, inplace=True)

In [8]:
# Get the total number of duplicate rows in the dataset
rental_duplicates = rental_listings[rental_listings.duplicated()]
num_rental_duplicates = len(rental_duplicates)
print(f"There are {num_rental_duplicates} duplicates in the rental dataset ")

There are 1303 duplicates in the rental dataset 


In [9]:
# Drop duplicate rows and keep the first occurrence
rental_listings = rental_listings.drop_duplicates().reset_index(drop=True)

#### Property features extraction

In [10]:
# Remove any information following the text 'Points of Interest' in the 'property_features' column
rental_listings["property_features"] = rental_listings["property_features"].str.replace(
    r'Points of Interest.*', '', regex=True
)

rental_listings["property_features"] = rental_listings["property_features"].str.strip()

In [11]:
patterns = {
    'Listing Number': r'Listing Number(\d+)',
    'Type of Property': r'Type of Property\s*(House|Apartment / Flat|Townhouse|Commercial|Industrial|Vacant Land|Farm)',
    #'Description': r'Description([A-Za-z]+)',
    #'Lifestyle': r'Lifestyle([A-Za-z]+)',
    #'Occupation Date': r'Occupation Date([\d\w\s]+?)Deposit Requirements',
    #'Deposit Amount': r'Deposit Amount:\s*R\s?([\d,]+)',
    'Pets Allowed': r'Pets Allowed(Yes|No)',
    'Furnished': r'Furnished(Yes|No)',
    'Bedrooms': r'Bedrooms?\s*(\d+)',
    'Bathrooms': r'Bathrooms?\s*(\d+)',
    'Kitchen': r'Kitchens?\s*(\d+)',
    'Lounge': r'Lounges?\s*(\d+)',
    'Dining Room': r'Dining Room(\d+)',
    'Parking': r'Parking(\d+)',
    #'Special Features': r'Special Features(\d+)',
    #'Backup Water': r'Backup Water(Yes|No)',
    'Erf Size (m²)': r'Erf Size\s*(\d+)\s*m²',
    'Floor Area (m²)': r'Floor Area\s*(\d+)\s*m²',
    'Garden': r'Gardens?\s*(\d+)',
    'Pool': r'Pools?\s*(\d+)',
    'Garage': r'Garages?\s*(\d+)'
}



# Function to extract features from text
def extract_features(text):
    return {
        field: (match := re.search(pattern, text)
                ) and match.group(1).strip() or None for field, pattern in patterns.items()
    }
    
# Apply extraction row-wise
extracted_df = rental_listings['property_features'].apply(extract_features).apply(pd.Series)

# Combine original df with the extracted columns
results_df = pd.concat([rental_listings, extracted_df], axis=1)

# View result
results_df.head()

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug,Listing Number,Type of Property,Pets Allowed,Furnished,...,Bathrooms,Kitchen,Lounge,Dining Room,Parking,Erf Size (m²),Floor Area (m²),Garden,Pool,Garage
0,R 7 000Per Month,"Bhisho, Tyu Tyu North",3 Bedroom House to Rent in Tyu Tyu North,RE/MAX Border is proud to present this 3-bedro...,Property OverviewListing Number115295115Type o...,3-bedroom-house-to-rent-in-tyu-tyu-north-11529...,115295115,House,No,No,...,1.0,1.0,1.0,1.0,1.0,,,,,
1,R 7 000Per Month,"Bhisho, Bhisho Central",3 Bedroom House to Rent in Bhisho Central,How CONVENIENT is this flat directly opposite ...,Property OverviewListing Number115332086Type o...,3-bedroom-house-to-rent-in-bhisho-central-1153...,115332086,House,No,No,...,1.0,1.0,1.0,,1.0,,,1.0,,
2,R 5 300Per Month,"Bhisho, Tyu Tyu North",2 Bedroom House to Rent in Tyu Tyu North,Comfortable Home for a family which is availab...,Property OverviewListing Number115210602Type o...,2-bedroom-house-to-rent-in-tyu-tyu-north-11521...,115210602,House,Yes,No,...,1.0,1.0,1.0,,1.0,,,1.0,,
3,R 7 500Per Month,"Bhisho, Bisho Park",3 Bedroom House to Rent in Bisho Park,BetterProp is thrilled to present this well-ma...,Property OverviewListing Number115555490Type o...,3-bedroom-house-to-rent-in-bisho-park-115555490,115555490,House,Yes,No,...,1.0,1.0,,,3.0,,,,,
4,R 5 500Per Month,"Bhisho, Bisho Park",House to Rent in Bisho Park,This neat family home boasts:3 bedroomsOpen pl...,Property OverviewListing Number115525482Type o...,house-to-rent-in-bisho-park-115525482,115525482,House,,,...,,,,,,,,,,


In [12]:
results_df["Bedrooms"].isnull().sum()

np.int64(11068)

In [13]:
def extract_bedrooms_from_property_title(title):
    if pd.isna(title):
        return None
    match = re.search(r'(\d+)[-\s]?bedroom', title.lower())
    return int(match.group(1)) if match else None

results_df['Bedrooms'] = results_df['Bedrooms'].fillna(
    results_df.apply(lambda row: extract_bedrooms_from_property_title(row['property_title']) if pd.isna(row['Bedrooms']) else row['Bedrooms'], axis=1))

# Optional: convert to integer if needed
results_df['Bedrooms'] = pd.to_numeric(results_df['Bedrooms'], errors='coerce').astype('Int64')

In [14]:
results_df["Bedrooms"].isnull().sum()

np.int64(8521)

####  Suburb, city and province extraction

In [None]:
# Extract the suburb from the property_title column
results_df['suburb'] = results_df['property_title'].str.extract(
    r'to Rent in (.+)', expand=False)

# Remove the trailing numbers from the suburb column
results_df['suburb'] = (
    results_df['suburb'].str.replace(r'-\d+$', '', regex=True).str.strip())

# Drop the property_title column as we have exytacted all the information we need
results_df.drop(columns=['property_title'], inplace=True)

# Print the first five rows of the table to see the results
results_df.head()

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug,Listing Number,Type of Property,Pets Allowed,Furnished,...,Kitchen,Lounge,Dining Room,Parking,Erf Size (m²),Floor Area (m²),Garden,Pool,Garage,suburb
0,R 7 000Per Month,"Bhisho, Tyu Tyu North",3 Bedroom House to Rent in Tyu Tyu North,RE/MAX Border is proud to present this 3-bedro...,Property OverviewListing Number115295115Type o...,3-bedroom-house-to-rent-in-tyu-tyu-north-11529...,115295115,House,No,No,...,1.0,1.0,1.0,1.0,,,,,,Tyu Tyu North
1,R 7 000Per Month,"Bhisho, Bhisho Central",3 Bedroom House to Rent in Bhisho Central,How CONVENIENT is this flat directly opposite ...,Property OverviewListing Number115332086Type o...,3-bedroom-house-to-rent-in-bhisho-central-1153...,115332086,House,No,No,...,1.0,1.0,,1.0,,,1.0,,,Bhisho Central
2,R 5 300Per Month,"Bhisho, Tyu Tyu North",2 Bedroom House to Rent in Tyu Tyu North,Comfortable Home for a family which is availab...,Property OverviewListing Number115210602Type o...,2-bedroom-house-to-rent-in-tyu-tyu-north-11521...,115210602,House,Yes,No,...,1.0,1.0,,1.0,,,1.0,,,Tyu Tyu North
3,R 7 500Per Month,"Bhisho, Bisho Park",3 Bedroom House to Rent in Bisho Park,BetterProp is thrilled to present this well-ma...,Property OverviewListing Number115555490Type o...,3-bedroom-house-to-rent-in-bisho-park-115555490,115555490,House,Yes,No,...,1.0,,,3.0,,,,,,Bisho Park
4,R 5 500Per Month,"Bhisho, Bisho Park",House to Rent in Bisho Park,This neat family home boasts:3 bedroomsOpen pl...,Property OverviewListing Number115525482Type o...,house-to-rent-in-bisho-park-115525482,115525482,House,,,...,,,,,,,,,,Bisho Park


In [16]:
# Load the csv that contains the capital_cities with their respective provinces and create a mapping
city_province = pd.read_csv("../data/raw/south_africa_provinces_capitals.csv")
city_list = city_province['city'].dropna().str.strip().unique()

In [17]:
def extract_city_from_location(location, city_list):
    if pd.isna(location):
        return None
    
    
    location_lower = location.lower()
    for city in city_list:
        if city.lower() in location_lower:
            return city
    return None

In [None]:
# Apply the extract_city_from_location function to extract the city from the location column
results_df['city'] = results_df['location'].apply(
    lambda loc: extract_city_from_location(loc, city_list))

# Drop the location column as it is redundant
results_df.drop(columns=["location"], inplace=True)

In [19]:
# Create a mapping of suburb to known cities
suburb_to_city_map = results_df[results_df['city'].notna()].drop_duplicates(subset=['suburb', 'city']).set_index('suburb')['city'].to_dict()

# Fill missing city values based on the suburb
results_df['city'] = results_df.apply(lambda row: suburb_to_city_map.get(row['suburb'], row['city']) if pd.isna(row['city']) else row['city'], axis= 1)   

In [20]:
results_df['city'].isnull().sum()

np.int64(91)

In [21]:
# Drop the city rows with null values
results_df = results_df.dropna(subset=['city'])

In [22]:
results_df['city'].isnull().sum()

np.int64(0)

In [23]:
# Merge the city with its matching province
results_df = results_df.merge(city_province, on='city', how='left')

#### Price column

In [24]:
results_df['price_type'] = results_df['price'].str.extract(
    r'(Per\s*Month|Per\s*Day|m2|m²|Perm²)', flags=re.IGNORECASE, expand=False
)


# 2. Normalize price_type text (remove spaces, lower case) then map to standard form
results_df['price_type'] = results_df['price_type'].str.replace(
    r'\s+', '', regex=True).str.lower()
results_df['price_type'] = results_df['price_type'].replace({
    'permonth': 'Per Month',
    'perday': 'Per Day',
    'm2': 'Per m²',
    'm²': 'Per m²',
    'perm²': 'Per m²'
})

# Remove spaces inside the price string first
results_df['price_clean'] = results_df['price'].str.replace(
    r'\s+', '', regex=True)

# Now extract the numeric part (digits only, continuous)
results_df['price (in rands)'] = results_df['price_clean'].str.extract(
    r'R?(\d+)', expand=False)

# Convert to numeric
results_df['price (in rands)'] = pd.to_numeric(
    results_df['price (in rands)'], errors='coerce')

# Drop the helper clean column if you want
results_df.drop(columns=['price_clean'], inplace=True)

# Then clean blank or whitespace-only price_type entries as before
results_df['price_type'] = results_df['price_type'].replace(
    r'^\s*$', pd.NA, regex=True)

# Preview results
print(results_df[['price', 'price_type', 'price (in rands)']].head())

              price price_type  price (in rands)
0  R 7 000Per Month  Per Month            7000.0
1  R 7 000Per Month  Per Month            7000.0
2  R 5 300Per Month  Per Month            5300.0
3  R 7 500Per Month  Per Month            7500.0
4  R 5 500Per Month  Per Month            5500.0


In [25]:
# View rows where price (in rands) is still null
null_price_rows = results_df[results_df['price (in rands)'].isna()]
null_price_rows[['price', 'price_type', 'suburb', 'Type of Property']].sample(10)

Unnamed: 0,price,price_type,suburb,Type of Property
4370,POAPer Day,Per Day,Clifton,House
12652,POAPer m²,Per m²,Nelspruit Central,Commercial
6653,POAPer Month,Per Month,Heriotdale,Industrial
5027,POA,,Llandudno,House
13645,POAPer Month,Per Month,Nirvana Industrial,Commercial
3169,POAPer Month,Per Month,Green Point,Apartment / Flat
5010,POAPer Day,Per Day,Clifton,House
4392,POAPer Month,Per Month,Claremont,Industrial
4123,POAPer Month,Per Month,Epping Industrial,Industrial
8031,POAPer Month,Per Month,Village Main,Industrial


In [26]:
results_df['price_type'].value_counts()

price_type
Per Month    12826
Per m²         948
Per Day        143
Name: count, dtype: int64

In [27]:
results_df['price_type'].isnull().sum()

np.int64(12)

In [28]:
results_df[results_df['price_type'].isna()]

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug,Listing Number,Type of Property,Pets Allowed,Furnished,...,Erf Size (m²),Floor Area (m²),Garden,Pool,Garage,suburb,city,province,price_type,price (in rands)
357,POA,,Industrial Property to Rent in Hilton,"Unit 1 within 29 Grey Street, Hilton - FS, Blo...",Property OverviewListing Number114774738Type o...,industrial-property-to-rent-in-hilton-114774738,114774738,Industrial,,No,...,,,,,,Hilton,Bloemfontein,Free State,,
3728,POA,"Cape Town, Bantry Bay",4 Bedroom House to Rent in Bantry Bay,"Salty ocean breezes, golden sunsets, breathtak...",Property OverviewListing Number115605901Type o...,4-bedroom-house-to-rent-in-bantry-bay-115605901,115605901,House,No,Yes,...,,,1.0,1.0,1.0,Bantry Bay,Cape Town,Western Cape,,
3760,POA,"Cape Town, Clifton",5 Bedroom House to Rent in Clifton,Peering over the edge of Africa from its cliff...,Property OverviewListing Number115594664Type o...,5-bedroom-house-to-rent-in-clifton-115594664,115594664,House,No,Yes,...,,,,1.0,1.0,Clifton,Cape Town,Western Cape,,
3771,POA,"Cape Town, Fresnaye",5 Bedroom House to Rent in Fresnaye,Drawing inspiration from the whitewashed villa...,Property OverviewListing Number115613951Type o...,5-bedroom-house-to-rent-in-fresnaye-115613951,115613951,House,,Yes,...,,,,1.0,1.0,Fresnaye,Cape Town,Western Cape,,
3837,POA,"Cape Town, Clifton",4 Bedroom House to Rent in Clifton,"With its gaze fixed to the horizon, this beach...",Property OverviewListing Number115606187Type o...,4-bedroom-house-to-rent-in-clifton-115606187,115606187,House,No,Yes,...,,,,1.0,2.0,Clifton,Cape Town,Western Cape,,
4180,POA,"Cape Town, Llandudno",5 Bedroom House to Rent in Llandudno,An exceptional and very glamorous Llandudno vi...,Property OverviewListing Number108778518Type o...,5-bedroom-house-to-rent-in-llandudno-108778518,108778518,House,,Yes,...,991.0,862.0,,,,Llandudno,Cape Town,Western Cape,,
4587,POA,"Cape Town, Camps Bay",5 Bedroom House to Rent in Camps Bay,*Please note that the daily rate ranges from R...,Property OverviewListing Number115118688Type o...,5-bedroom-house-to-rent-in-camps-bay-115118688,115118688,House,No,Yes,...,530.0,,,,,Camps Bay,Cape Town,Western Cape,,
4610,POA,"Cape Town, Clifton",5 Bedroom House to Rent in Clifton,Clifton beach bungalow with entertainment area...,Property OverviewListing Number115153896Type o...,5-bedroom-house-to-rent-in-clifton-115153896,115153896,House,No,Yes,...,,463.0,,,,Clifton,Cape Town,Western Cape,,
4896,POA,"Cape Town, Llandudno",8 Bedroom House to Rent in Llandudno,A beautifully crafted furnished home with the ...,Property OverviewListing Number110813037Type o...,8-bedroom-house-to-rent-in-llandudno-110813037,110813037,House,,Yes,...,,605.0,,,,Llandudno,Cape Town,Western Cape,,
4899,POA,"Cape Town, Constantia",7 Bedroom House to Rent in Constantia,"Welcome to this Constantia homestead, steeped ...",Property OverviewListing Number113953599Type o...,7-bedroom-house-to-rent-in-constantia-113953599,113953599,House,,Yes,...,,,,,,Constantia,Cape Town,Western Cape,,


Identify price columns that are equal to POA and contain the price type and those without. 

In [30]:
# Get the number of rows in the dataset in the 'price' column that are equal to 'POA'
poa_mask = results_df['price'].astype(
    str).str.contains("POA", case=False, na=False)
num_poa = poa_mask.sum()

# Print the total number of rows
print(f"There are {num_poa} listings with 'POA' in the price column.")

There are 217 listings with 'POA' in the price column.


In [31]:
# Identify POA rows with valid price_type
poa_mask = results_df['price'].str.lower().str.startswith('poa')
poa_with_type_mask = poa_mask & results_df['price_type'].notna() & (results_df['price_type'] != '')

# Calculate averages for the suburb, Type of Property and price_type
avg_prices = results_df[~poa_mask & results_df['price (in rands)'].notna()] \
    .groupby(['suburb', 'Type of Property', 'price_type'])['price (in rands)'] \
    .apply(lambda x: pd.to_numeric(x, errors='coerce').mean()) \
    .reset_index().rename(columns={'price (in rands)': 'price_avg'})
    
# Merge the averages into the original DataFrame
results_df = results_df.merge(avg_prices, on=['suburb', 'Type of Property', 'price_type'], how='left')

# Replace the POA with non-empty price_types with the averages
results_df.loc[poa_with_type_mask,
'price (in rands)'] = results_df.loc[poa_with_type_mask, 'price_avg']

# Drop the POA rows without a price type
results_df = results_df[~(poa_mask & ~poa_with_type_mask)]

# Drop the helper column
results_df.drop(columns=['price_avg'], inplace=True)

In [32]:
print("Remaining 'POA':", results_df['price'].str.lower().eq('poa').sum())
print("Remaining nulls in 'price (in rands)':",
      results_df['price (in rands)'].isna().sum())

Remaining 'POA': 0
Remaining nulls in 'price (in rands)': 12


In [33]:
results_df[results_df['price (in rands)'].isna()]

Unnamed: 0,price,location,property_title,property_description,property_features,listing_slug,Listing Number,Type of Property,Pets Allowed,Furnished,...,Erf Size (m²),Floor Area (m²),Garden,Pool,Garage,suburb,city,province,price_type,price (in rands)
4246,POAPer Day,"Cape Town, Fresnaye",4 Bedroom Apartment / Flat to Rent in Fresnaye,Positioned in the prestigious Fresnaye neighbo...,Property OverviewListing Number115374518Type o...,4-bedroom-apartment-flat-to-rent-in-fresnaye-1...,115374518,Apartment / Flat,,,...,,,,1.0,,Fresnaye,Cape Town,Western Cape,Per Day,
4586,POAPer Day,"Cape Town, Zwaanswyk",7 Bedroom House to Rent in Zwaanswyk,Escape to Paradise...Welcome to a world of enc...,Property OverviewListing Number112682529Type o...,7-bedroom-house-to-rent-in-zwaanswyk-112682529,112682529,House,,Yes,...,,,,,,Zwaanswyk,Cape Town,Western Cape,Per Day,
4836,POAPer Day,"Cape Town, Bakoven",3 Bedroom Apartment / Flat to Rent in Bakoven,This magnificent penthouse located in Barley B...,Property OverviewListing Number111244618Type o...,3-bedroom-apartment-flat-to-rent-in-bakoven-11...,111244618,Apartment / Flat,,,...,,286.0,,1.0,,Bakoven,Cape Town,Western Cape,Per Day,
4863,POAPer Day,"Cape Town, Fresnaye",4 Bedroom Apartment / Flat to Rent in Fresnaye,Discover Luxury Living in the Heart of Cape To...,Property OverviewListing Number114968205Type o...,4-bedroom-apartment-flat-to-rent-in-fresnaye-1...,114968205,Apartment / Flat,,,...,,,,1.0,,Fresnaye,Cape Town,Western Cape,Per Day,
4991,POAPer Day,"Cape Town, Bantry Bay",3 Bedroom Apartment / Flat to Rent in Bantry Bay,*Available 01 May 2025 - 31 August 2025*If it ...,Property OverviewListing Number111859212Type o...,3-bedroom-apartment-flat-to-rent-in-bantry-bay...,111859212,Apartment / Flat,,Yes,...,,,,1.0,,Bantry Bay,Cape Town,Western Cape,Per Day,
7450,POAPer m²,"Aa10 - Cnr Atlas & Paul Smit Street, Heriotdal...",Commercial Property to Rent in Heriotdale,1.\tOffices with large windows2.\tKitchenette3...,Property OverviewListing Number115523149Type o...,commercial-property-to-rent-in-heriotdale-1155...,115523149,Commercial,,,...,,235.0,,,,Heriotdale,Johannesburg,Gauteng,Per m²,
8317,POAPer m²,"Aa13-2 - Cnr Atlas & Paul Smit Street, Heriotd...",Commercial Property to Rent in Heriotdale,1.\tAmple Parking2.\tOffices with large window...,Property OverviewListing Number115418255Type o...,commercial-property-to-rent-in-heriotdale-1154...,115418255,Commercial,,,...,,260.0,,,,Heriotdale,Johannesburg,Gauteng,Per m²,
8706,POAPer Month,,Industrial Property to Rent in Westgate,Discover this spacious 500 square meter proper...,Property OverviewListing Number114459245Type o...,industrial-property-to-rent-in-westgate-114459245,114459245,Industrial,,No,...,500.0,500.0,,,,Westgate,Johannesburg,Gauteng,Per Month,
8758,POAPer m²,"B11-2 - Lower Germiston Road, Heriotdale, Joha...",Commercial Property to Rent in Heriotdale,1.\tOffices2.\tAmple Kitchenette3.\tAmple Ablu...,Property OverviewListing Number115517306Type o...,commercial-property-to-rent-in-heriotdale-1155...,115517306,Commercial,,,...,,225.0,,,,Heriotdale,Johannesburg,Gauteng,Per m²,
12308,POAPer Month,"Nelspruit, Nelsville",Industrial Property to Rent in Nelsville,"This rarely available 3,788sqm industrial ware...",Property OverviewListing Number113997397Type o...,industrial-property-to-rent-in-nelsville-11399...,113997397,Industrial,,No,...,,,,,,Nelsville,Nelspruit,Mpumalanga,Per Month,


In [None]:
# Drop rows in the price (in rands) column that are null
results_df = results_df.dropna(subset=['price (in rands)'])

In [None]:
# Drop the price column as we do not need it anymore
results_df.drop(columns=['price'], inplace=True)

In [None]:
# Set display options
pd.set_option('display.max_columns', None)        # Show all columns
# Show all rows (if you're looking at multiple)
pd.set_option('display.max_rows', None)
# Show full content in each cell
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', None)              # Prevent line wrapping

### Pets Allowed

In [42]:
results_df['Pets Allowed'].isnull().sum()

np.int64(10304)

In [44]:
results_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 13905 entries, 0 to 13928
Data columns (total 26 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   price                 13905 non-null  object 
 1   location              7488 non-null   object 
 2   property_title        13905 non-null  object 
 3   property_description  13905 non-null  object 
 4   property_features     13905 non-null  object 
 5   listing_slug          13905 non-null  object 
 6   Listing Number        13905 non-null  object 
 7   Type of Property      13905 non-null  object 
 8   Pets Allowed          3601 non-null   object 
 9   Furnished             9270 non-null   object 
 10  Bedrooms              5456 non-null   Int64  
 11  Bathrooms             7792 non-null   object 
 12  Kitchen               6518 non-null   object 
 13  Lounge                1606 non-null   object 
 14  Dining Room           802 non-null    object 
 15  Parking               54

In [40]:
results_df['Pets Allowed'].value_counts()

Pets Allowed
No     2598
Yes    1003
Name: count, dtype: int64

### Listings for sale

In [None]:
listings_for_sale = pd.read_csv("../data/raw/listings_for_sale_in_capital_cities.csv")
display("Listings for sale", listings_for_sale.head(), listings_for_sale.shape)

In [None]:
# Get information about the listings_for_sale document
listings_for_sale.info()

In [None]:
# See the number of null values in each column
listings_for_sale.isnull().sum()

In [None]:
# Drop listings where the property_title is null
listings_for_sale = listings_for_sale.dropna(subset=['property_title']).reset_index(drop=True)

In [None]:
listings_for_sale_duplicates = listings_for_sale[listings_for_sale.duplicated()]
num_listings_for_sale_duplicates = len(listings_for_sale_duplicates)
print(f"There are {num_listings_for_sale_duplicates} duplicates in the rental dataset ")

In [None]:
# Drop duplicate rows and keep the first occurrence
listings_for_sale = listings_for_sale.drop_duplicates().reset_index(drop=True)

In [None]:
print(listings_for_sale['price'].str.contains(
    'POA', case=False, na=False).sum())

In [None]:
"""_summary_
# Remove rows where 'price' contains 'POA' (case-insensitive)
listings_for_sale = listings_for_sale[~listings_for_sale['price'].astype(
    str).str.contains('POA', case=False, na=False)]
rental_listings.reset_index(drop=True, inplace=True)
    """

In [None]:
listings_for_sale['price_cleaned'] = listings_for_sale['price'].str.extract(
    r'(R\s?\d[\d\s]*)')

listings_for_sale['price_cleaned'] = listings_for_sale['price_cleaned'].str.replace(
    r'\s+', '', regex=True)

listings_for_sale['price (in rands)'] = listings_for_sale['price_cleaned'].str.extract(
    r'R?(\d+)', expand=False)

# Print out the first five rows to see the results
listings_for_sale.head()

In [None]:
# Extract the location from the listing_slug column
listings_for_sale['suburb'] = listings_for_sale['property_title'].str.extract(
    r'for Sale in (.+)', expand=False)

# Remove the trailing numbers from the suburb column
listings_for_sale['suburb'] = (
    listings_for_sale['suburb'].str.replace(r'-\d+$', '', regex=True).str.strip())

# Print the first five rows of the table to see the results
listings_for_sale.head()

In [None]:
# Apply the extract_city_from_location function to extract the function
listings_for_sale['city'] = listings_for_sale['location'].apply(
    lambda loc: extract_city_from_location(loc, city_list))

In [None]:
listings_for_sale['city'].isnull().sum()

In [None]:
# Create a mapping of suburb to kmown cities
suburb_to_city_map = listings_for_sale[listings_for_sale['city'].notna()].drop_duplicates(
    subset=['suburb', 'city']).set_index('suburb')['city'].to_dict()

# Fill missing city values based on the suburb
listings_for_sale['city'] = listings_for_sale.apply(lambda row: suburb_to_city_map.get(
    row['suburb'], row['city']) if pd.isna(row['city']) else row['city'], axis=1)

In [None]:
# See the number of unique values in the city column
listings_for_sale['city'].value_counts()

In [None]:
listings_for_sale.isnull().sum()

In [None]:
# Drop the rows where the city_column is empty
listings_for_sale = listings_for_sale.dropna(subset=['city'])

In [None]:
listings_for_sale.isnull().sum()

In [None]:
# Merge to get the matching province for the listings
listings_for_sale = listings_for_sale.merge(city_province, on='city', how='left')

In [None]:
listings_for_sale.isnull().sum()

In [None]:
poa_mask = listings_for_sale['price'].astype(
    str).str.contains("POA", case=False, na=False)
num_poa = poa_mask.sum()

print(f"There are {num_poa} listings with 'POA' in the price column.")