<a href="https://www.kaggle.com/code/mikedelong/karachi-real-estate-eda?scriptVersionId=144782149" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
import numpy as np
import pandas as pd
# we can drop the Type column because it is always House
df = pd.read_csv(filepath_or_buffer='/kaggle/input/karachi-houses/homes.csv').drop(columns='Type')
df.head()

In [None]:
df.info()

In [None]:
from plotly.express import histogram
def get_area_amount(arg):
    if isinstance(arg, float): # nans
        return arg
    if arg == 'Area': # special case
        return np.nan
    return int(arg.split()[0].replace(',', ''))
df['area_amount'] = df['Area'].apply(func=get_area_amount)
histogram(data_frame=df, x='area_amount', log_y=True)

We expect our prices to be a function of the area, neighborhood, and room details; we need to get the area as an integer, and we need to deal with some noise in the data. We use a log scale here to spread out our cluster near zero. This is an exploratory histogram and probably not appropriate for the general public. 

In [None]:
df['area_amount'].isna().sum() / len(df)

2.3% of our area data is useless; this is probably not bad.

In [None]:
def get_area_units(arg):
    if isinstance(arg, float):
        return arg
    pieces = arg.split()
    if len(pieces) == 1:
        return np.nan
    return ' '.join(pieces[1:])
df['area_units'] = df['Area'].apply(get_area_units)
df['area_units'].value_counts(dropna=False)

All of our area is in the same units, so we don't need to do any conversions. And again about 2% of our area data is unusable.

In [None]:
histogram(data_frame=df, x='Baths', log_y=True)

This is kind of a mess, so we will need to introduce something we can use. We assume '-' means zero instead of unknown.

In [None]:
def get_baths(arg):
    if isinstance(arg, float):
        return arg
    if str(arg).isdigit():
        return int(arg)
    if arg == '-':
        return 0
    return np.nan
df['baths'] = df['Baths'].apply(get_baths)
histogram(data_frame=df, x='baths')    

In [None]:
df['baths'].isna().sum()

In [None]:
df['bedrooms'] = df['Bedrooms'].apply(get_baths)
histogram(data_frame=df, x='bedrooms')

We can reuse our baths cleanup function without fear or favor.

In [None]:
df[['baths', 'bedrooms']].value_counts()

In [None]:
from plotly.express import scatter
scatter(data_frame=df[['baths', 'bedrooms']].value_counts().to_frame().reset_index(), 
       x='bedrooms', y='baths', size='count')

Now we can sort of see our inventory on a rooms basis in a single plot.

In [None]:
df['Location'].value_counts(dropna=False)

In [None]:
from plotly.express import bar
df['location'] = df['Location'].apply(func=lambda x: x if isinstance(x, float) else x.split(',')[0])
for log_y in [False, True]:
    bar(data_frame=df['location'].value_counts().to_frame().reset_index(), x='location', y='count', log_y=log_y).show()

Most of the time our location may tell us something; sometimes it will probably tell us nothing.

In [None]:
scatter(data_frame=df[['location', 'baths']].groupby(by=['location', 'baths']).size().reset_index().rename(columns={0: 'count'}), x='location', y='baths', color='count')

In most locations the baths inventory is highly diverse; it is only concentrated in rare instances, e.g. 6 bath houses in DHA Defence.

In [None]:
scatter(data_frame=df[['location', 'bedrooms']].groupby(by=['location', 'bedrooms']).size().reset_index().rename(columns={0: 'count'}), x='location', y='bedrooms', color='count')

In [None]:
df['Price'].value_counts(dropna=False)

Time to do our crore and lakh conversions to make these into numerical data.

In [None]:
df['price_currency'] = df['Price'].apply(func=lambda x: x.split(',')[0])
df['price_currency'] = df['price_currency'].apply(lambda x: x if x == 'PKR' else np.nan)
df['price_currency'].value_counts(dropna=False)

Our currency is always PKR if it isn't noise.

In [None]:
def get_price_amount(arg):
    comma_pieces = arg.split(',')
    if comma_pieces[0] == 'Price':
        return np.nan
    pieces = comma_pieces[1].split(' ')
    raw_amount = float(pieces[0])
    multiplier = 1
    if pieces[1] == 'Lakh':
        multiplier = 10000
    elif pieces[1] == 'Crore':
        multiplier = 1000000
    return multiplier * raw_amount

df['price_amount'] = df['Price'].apply(get_price_amount)
histogram(data_frame=df, x='price_amount', log_y=True)

In [None]:
from plotly.express import bar
for column in ['Purpose',]:
    bar(data_frame=df[column].value_counts().to_frame().reset_index(), x=column, y='count').show()

We have some noise in our Purpose column we will need to clean up.

In [None]:
df['for_sale'] = df['Purpose'].apply(func=lambda x: x == 'For Sale')
df['for_sale'].value_counts(dropna=False)

A little over 1% of our Purpose data is bad. Unfortunately because our Purpose data is essentially the same in every case it will not tell us anything that will help our analysis.

We have our clean dataset now; how much of our dataset is not useful?

In [None]:
len(df.dropna()) / len(df)

96% of our rows have no nulls, so we should be able to proceed with confidence that our parsing and cleanup hasn't completely ruined our data.

Now we can look at how different things affect price.

In [None]:
from plotly.express import imshow
imshow(img=df[['baths', 'bedrooms', 'price_amount']].corr())

It is not surprising that bedroom and bath counts are highly correlated; it is more surprising that price is not so highly correlated.

In [None]:
bar(data_frame=df[['location', 'price_amount']].groupby(by=['location']).mean().reset_index().sort_values(by='price_amount'),
    x='location', y='price_amount', title='Location mean price')

We had to do a fair amount of work to get here, but this may be the nut graf. Mean prices vary a lot by location.

In [None]:
bar(data_frame=df[['location', 'price_amount']].groupby(by=['location']).median().reset_index().sort_values(by='price_amount'),
    x='location', y='price_amount', title='Location median price')

In [None]:
mean_df = df[['location', 'price_amount']].groupby(by=['location']).mean().reset_index()
median_df = df[['location', 'price_amount']].groupby(by=['location']).median().reset_index()
skew_df = mean_df.merge(right=median_df, on='location', how='inner').rename(columns={'price_amount_x': 'mean price', 'price_amount_y': 'median price'})
scatter(data_frame=skew_df, x='mean price', y='median price', hover_name='location',
       log_x=True, log_y=True)

In [None]:
skew_df['ratio'] = skew_df['mean price'] / skew_df['median price']
bar(data_frame=skew_df, x='location', y='ratio')

In most locations the mean and median price are pretty close but we do have some mostly rightward skew. E.g. in Hill Park the mean is rougly twice the mean.

In [None]:
scatter(data_frame=df, x='location', y='price_amount', color='baths', height=800)

We may have some noise in our baths data if zero baths houses can be the most expensive properties in a location.

In [None]:
scatter(data_frame=df, x='location', y='price_amount', color='bedrooms', height=800)

Ditto bedrooms. Maybe some of those properties in Clifton and BMCHS are just really expensive vacant lots mislabeled as houses. Who knows?