## Introduction to Data Cleaning for Real Estate Price Prediction
Accurate real estate price prediction relies heavily on the quality of the underlying data. In this project, the initial focus is on cleaning and preparing a dataset containing detailed information about flats in Gurgaon. The data cleaning process ensures that the dataset is consistent, reliable, and ready for further analysis or modeling.

### Overview of the Data Cleaning Steps
#### 1. Loading and Inspecting the Data
* The dataset is imported directly from a raw CSV file hosted online.

* Initial inspection includes viewing random samples and checking data types, which were found to be mostly object (string) types.

#### 2. Handling Unnecessary and Duplicate Data
* Columns that do not contribute to price prediction (such as link and property_id) are dropped.

* Duplicate entries are identified and can be removed to avoid redundancy.

#### 3. Managing Missing Values
* A column-wise check is performed to identify missing values.

* For columns like bedRoom, rows with missing values are removed if other critical fields are also missing.

* For categorical fields such as additionalRoom and facing, missing values are filled with a placeholder ('not available') and standardized to lowercase for consistency.

#### 4. Data Type Conversion and Feature Engineering
* Numerical columns stored as strings (e.g., price, pricePerSqft, bedRoom, bathroom) are converted to appropriate numeric types after cleaning and formatting.

* The balcony column, which sometimes contains ambiguous values like '3+', is cleaned by extracting numeric values and handling 'No' as zero.

* The floorNum column is standardized by converting textual representations (e.g., 'Ground', 'Basement') to numeric values.

#### 5. Cleaning and Standardizing Categorical Data
* The society column, which contains many unique values and ratings, is cleaned by removing rating symbols and converting all entries to lowercase.

* Categorical columns are prepared for future encoding, with consideration given to the high cardinality of some fields.

#### 6. Feature Creation
* New features such as area (calculated from price and price per square foot) and property_type are created to enrich the dataset and provide more predictive power.

#### 7. Exporting the Cleaned Data
* The final cleaned dataset is saved as a new CSV file, ready for use in modeling and analysis.

In [1]:
import numpy as np
import pandas as pd

In [2]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [3]:
base_url = "https://raw.githubusercontent.com/pranta-iitp/Real-Estate-Property-Price-Prediction-Project/main/flats.csv"
df = pd.read_csv(base_url)

In [4]:
df.sample(5)

Unnamed: 0,property_name,link,society,price,area,areaWithType,bedRoom,bathroom,balcony,additionalRoom,address,floorNum,facing,agePossession,nearbyLocations,description,furnishDetails,features,rating,property_id
682,2 BHK Flat in Sector 108 Gurgaon,https://www.99acres.com/2-bhk-bedroom-apartmen...,Experion The Heartsong3.9 ★,98 Lac,"₹ 13,343/sq.ft.",Super Built up area 1283(119.19 sq.m.)Built Up...,2 Bedrooms,3 Bathrooms,3 Balconies,Study Room,"Sector 108 Gurgaon, Gurgaon, Haryana",4th of 14 Floors,East,1 to 5 Year Old,"['Galleria 108 Mall', 'Dwarka Expressway', 'Ce...",Corner and airy flat with ample of sunlight th...,"['1 Bed', '1 Sofa', '8 Light', '3 AC', '1 Curt...","['Security / Fire Alarm', 'Feng Shui / Vaastu ...","['Green Area5 out of 5', 'Construction4 out of...",P69721330
634,3 BHK Flat in Sector 61 Gurgaon,https://www.99acres.com/3-bhk-bedroom-apartmen...,Pioneer Park3.8 ★,2.04 Crore,"₹ 14,623/sq.ft.",Super Built up area 1795(166.76 sq.m.)Built Up...,3 Bedrooms,3 Bathrooms,3+ Balconies,Servant Room,"Near To Sector 56 Rapid Metro Station Gurgaon,...",16th of 32 Floors,South,5 to 10 Year Old,"['Sector 55-56 Rapid Metro', 'Hong Kong Bazaar...",An excellent 3 bhk residential apartment is fo...,"['3 Wardrobe', '11 Fan', '1 Exhaust Fan', '4 G...","['Centrally Air Conditioned', 'Water purifier'...","['Green Area5 out of 5', 'Construction3 out of...",S69946912
2764,2 BHK Flat in Sector 78 Gurgaon,https://www.99acres.com/2-bhk-bedroom-apartmen...,Umang Monsoon Breeze,70 Lac,"₹ 5,719/sq.ft.",Built Up area: 1224 (113.71 sq.m.),2 Bedrooms,2 Bathrooms,2 Balconies,,"Sector 78,gurgaon, Sector 78 Gurgaon, Gurgaon,...",5th of 12 Floors,,undefined,"['Proposed Metro Station', 'Mahapal Shing', 'N...",2bhk multistorey apartment for resale in umang...,"['1 Light', 'No AC', 'No Bed', 'No Chimney', '...",,"['Safety4 out of 5', 'Lifestyle4 out of 5', 'E...",D67903782
2190,4 BHK Flat in Sector 82 Gurgaon,https://www.99acres.com/4-bhk-bedroom-apartmen...,Mapsko Casa Bella3.8 ★,1.75 Crore,"₹ 6,903/sq.ft.",Built Up area: 2535 (235.51 sq.m.)Carpet area:...,4 Bedrooms,5 Bathrooms,3+ Balconies,,"Sector 82, Sector 82 Gurgaon, Gurgaon, Haryana",1st of 1 Floors,West,undefined,"['Vatika City Centre Mall', 'Pataudi Road', 'B...","Low floor easy for access , well maintained lo...","['1 Light', 'No AC', 'No Bed', 'No Chimney', '...",,"['Green Area4.5 out of 5', 'Construction4 out ...",J69772634
2477,2 BHK Flat in Sector 63A Gurgaon,https://www.99acres.com/2-bhk-bedroom-apartmen...,Signature Global City 63A,1.65 Crore,"₹ 15,263/sq.ft.",Super Built up area 1081(100.43 sq.m.),2 Bedrooms,2 Bathrooms,2 Balconies,,"Sector 63A Gurgaon, Gurgaon, Haryana",1st of 4 Floors,,Oct 2024,"['Sector 54 Chowk Metro Station', 'Paras Trini...",It's a 2bhk in a popular area (Adjoining humon...,"['1 Wardrobe', '1 Fan', '1 Light', '1 AC', 'No...",,"['Environment4 out of 5', 'Lifestyle4 out of 5...",G69165782


#### Data types of all the columns is object(string) Which is needed to be changed

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3017 entries, 0 to 3016
Data columns (total 20 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   property_name    3017 non-null   object
 1   link             3017 non-null   object
 2   society          3016 non-null   object
 3   price            3007 non-null   object
 4   area             3004 non-null   object
 5   areaWithType     3008 non-null   object
 6   bedRoom          3008 non-null   object
 7   bathroom         3008 non-null   object
 8   balcony          3008 non-null   object
 9   additionalRoom   1694 non-null   object
 10  address          3002 non-null   object
 11  floorNum         3006 non-null   object
 12  facing           2127 non-null   object
 13  agePossession    3007 non-null   object
 14  nearbyLocations  2913 non-null   object
 15  description      3008 non-null   object
 16  furnishDetails   2203 non-null   object
 17  features         2594 non-null   

In [6]:
df.duplicated().sum()

np.int64(0)

#### Column wise missing values

In [7]:
df.isnull().sum()

Unnamed: 0,0
property_name,0
link,0
society,1
price,10
area,13
areaWithType,9
bedRoom,9
bathroom,9
balcony,9
additionalRoom,1323


In [8]:
df.columns

Index(['property_name', 'link', 'society', 'price', 'area', 'areaWithType',
       'bedRoom', 'bathroom', 'balcony', 'additionalRoom', 'address',
       'floorNum', 'facing', 'agePossession', 'nearbyLocations', 'description',
       'furnishDetails', 'features', 'rating', 'property_id'],
      dtype='object')

#### Two Columns(link,property_id) will not be helpful. So we permanently drop those columns.

In [9]:
df.drop(columns=['link','property_id'],inplace=True)

In [10]:
df.columns

Index(['property_name', 'society', 'price', 'area', 'areaWithType', 'bedRoom',
       'bathroom', 'balcony', 'additionalRoom', 'address', 'floorNum',
       'facing', 'agePossession', 'nearbyLocations', 'description',
       'furnishDetails', 'features', 'rating'],
      dtype='object')

In [11]:
df.shape

(3017, 18)

In [12]:
df.rename(columns={'area':'pricePerSqft'},inplace=True)

In [13]:
df.columns

Index(['property_name', 'society', 'price', 'pricePerSqft', 'areaWithType',
       'bedRoom', 'bathroom', 'balcony', 'additionalRoom', 'address',
       'floorNum', 'facing', 'agePossession', 'nearbyLocations', 'description',
       'furnishDetails', 'features', 'rating'],
      dtype='object')

In [14]:
df['society'].value_counts()

Unnamed: 0_level_0,count
society,Unnamed: 1_level_1
SS The Leaf3.8 ★,73
Tulip Violet4.3 ★,40
Shapoorji Pallonji Joyville Gurugram4.0 ★,39
Signature Global Park4.0 ★,36
Shree Vardhman Victoria3.8 ★,35
Tulip Violet4.2 ★,33
Smart World Gems,32
Emaar MGF Emerald Floors Premier3.8 ★,32
Smart World Orchard,32
DLF The Ultima4.0 ★,31


In [15]:
df['society'].nunique()

638

#### The column *'society'* is a categorical column and it has 638 unique values. So now if we apply one hot encoding the number of columns will increase drastically.

In [16]:
# Convert each value of society into lower case and remove the rating part
import re
df['society'] = df['society'].apply(lambda name: re.sub(r'\d+(\.\d+)?\s?★', '', str(name)).strip()).str.lower()

In [17]:
df['society'].nunique()

604

In [18]:
#price
df['price'].nunique()

439

In [19]:
# Check if all values in the specified column are numerical
is_numerical = pd.to_numeric(df['price'], errors='coerce').notna().all()

In [20]:
is_numerical

np.False_

In [21]:
df['price'].unique()

array(['45 Lac', '50 Lac', '40 Lac', '1.47 Crore', '70 Lac', '41 Lac',
       '2 Crore', '1.8 Crore', '1.1 Crore', '4.75 Crore', '96 Lac',
       '29 Lac', '1.35 Crore', '95 Lac', '3.95 Crore', '90 Lac',
       '1.05 Crore', nan, '2.2 Crore', '1.01 Crore', '1.85 Crore',
       '86 Lac', '2.85 Crore', 'Price on Request', '42 Lac', '6.15 Crore',
       '6.25 Crore', '1.6 Crore', '3.25 Crore', '85 Lac', '75 Lac',
       '82 Lac', '29.99 Lac', '78 Lac', '74 Lac', '3.2 Crore',
       '1.3 Crore', '25 Lac', '1.99 Crore', '1.83 Crore', '2.25 Crore',
       '2.8 Crore', '83 Lac', '80 Lac', '1.25 Crore', '23 Lac', '30 Lac',
       '1.55 Crore', '79 Lac', '99 Lac', '1.9 Crore', '1 Crore',
       '2.5 Crore', '55 Lac', '65 Lac', '31.75 Lac', '93 Lac',
       '1.2 Crore', '56 Lac', '2.7 Crore', '1.45 Crore', '46 Lac',
       '4.5 Crore', '64 Lac', '28 Lac', '3.87 Crore', '1.38 Crore',
       '43 Lac', '28.5 Lac', '1.75 Crore', '2.1 Crore', '1.29 Crore',
       '1.65 Crore', '24 Lac', '31.5 Lac', '

In [22]:
df[df['price'] == 'Price on Request'].shape

(11, 18)

In [23]:
df = df[df['price'] != 'Price on Request']

In [24]:
# convert all prices to crore
def treat_price(x):
    if type(x) == float:
        return x
    else:
        if x[1] == 'Lac':
            return round(float(x[0])/100,2)
        else:
            return round(float(x[0]),2)

In [25]:
df['price'] = df['price'].str.split(' ').apply(treat_price).astype('float')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price'] = df['price'].str.split(' ').apply(treat_price).astype('float')


In [26]:
df['price'].unique()

array([ 0.45,  0.5 ,  0.4 ,  1.47,  0.7 ,  0.41,  2.  ,  1.8 ,  1.1 ,
        4.75,  0.96,  0.29,  1.35,  0.95,  3.95,  0.9 ,  1.05,   nan,
        2.2 ,  1.01,  1.85,  0.86,  2.85,  0.42,  6.15,  6.25,  1.6 ,
        3.25,  0.85,  0.75,  0.82,  0.3 ,  0.78,  0.74,  3.2 ,  1.3 ,
        0.25,  1.99,  1.83,  2.25,  2.8 ,  0.83,  0.8 ,  1.25,  0.23,
        1.55,  0.79,  0.99,  1.9 ,  1.  ,  2.5 ,  0.55,  0.65,  0.32,
        0.93,  1.2 ,  0.56,  2.7 ,  1.45,  0.46,  4.5 ,  0.64,  0.28,
        3.87,  1.38,  0.43,  1.75,  2.1 ,  1.29,  1.65,  0.24,  1.7 ,
        2.75,  0.34,  1.5 ,  0.26,  2.91,  1.15,  1.95,  0.88,  3.35,
        1.08,  2.15,  1.79,  0.22,  0.36,  2.9 ,  0.6 ,  2.4 ,  0.38,
        0.39,  0.27,  2.65,  1.32,  0.59,  1.4 ,  3.  ,  3.67,  3.85,
        3.29,  2.33,  2.79,  3.4 ,  0.72,  0.71,  1.41,  2.05,  0.54,
        1.42,  2.29,  0.97,  8.25,  4.25,  4.1 ,  0.92,  0.91,  0.66,
        0.53,  1.71,  5.7 ,  4.  ,  1.93,  3.45,  1.48,  1.49,  3.49,
        2.3 ,  2.12,

In [27]:
df.sample(5)

Unnamed: 0,property_name,society,price,pricePerSqft,areaWithType,bedRoom,bathroom,balcony,additionalRoom,address,floorNum,facing,agePossession,nearbyLocations,description,furnishDetails,features,rating
164,2 BHK Flat in Sector 95A Gurgaon,signature signum 95a,0.4,"₹ 7,782/sq.ft.",Carpet area: 514 (47.75 sq.m.),2 Bedrooms,2 Bathrooms,2 Balconies,,"Sector 95A Gurgaon, Gurgaon, Haryana",3rd of 19 Floors,East,0 to 1 Year Old,,Prime location\nAll necessary shops are in pre...,"['3 Fan', '1 Geyser', '5 Light', '1 Chimney', ...","['Security / Fire Alarm', 'Intercom Facility',...","['Environment5 out of 5', 'Safety5 out of 5', ..."
339,2 BHK Flat in Sector 79 Gurgaon,godrej,1.29,"₹ 8,206/sq.ft.",Super Built up area 1572(146.04 sq.m.),2 Bedrooms,2 Bathrooms,3 Balconies,,"Sector 79 Gurgaon, Gurgaon, Haryana",4th of 15 Floors,,0 to 1 Year Old,"['Vatika Town Square-INXT', 'Naurangpur Road',...",This is your chance to own a 2 bhk residential...,,,"['Green Area5 out of 5', 'Construction5 out of..."
2676,2 BHK Flat in Sector 65 Gurgaon,m3m heights,1.9,"₹ 13,991/sq.ft.",Built Up area: 1358 (126.16 sq.m.),2 Bedrooms,2 Bathrooms,2 Balconies,,"Sector 65, Sector 65 Gurgaon, Gurgaon, Haryana",40th of 40 Floors,,undefined,"['Rapid Metro Sector 56', 'M3m 65th Avenue Mal...",Best in class property available at sector 65 ...,"['1 Light', 'No AC', 'No Bed', 'No Chimney', '...",,"['Environment4 out of 5', 'Safety4 out of 5', ..."
783,3 BHK Flat in Sector 65 Gurgaon,emaar emerald hills,1.95,"₹ 13,928/sq.ft.",Carpet area: 1400 (130.06 sq.m.),3 Bedrooms,3 Bathrooms,3+ Balconies,Store Room,"Amber Block, Sector 65 Gurgaon, Gurgaon, Haryana",1st of 2 Floors,North-East,0 to 1 Year Old,"['Emerald Plaza Shopping Mall', 'Southern Peri...","Situated in sector 65 gurgaon, emaar emerald h...","['1 Water Purifier', '5 Fan', '1 Exhaust Fan',...","['Security / Fire Alarm', 'Power Back-up', 'Fe...","['Green Area5 out of 5', 'Construction5 out of..."
1422,2 BHK Flat in Sector 37C Gurgaon,apex our homes,0.33,"₹ 3,567/sq.ft.",Super Built up area 925(85.94 sq.m.)Built Up a...,2 Bedrooms,2 Bathrooms,1 Balcony,,"358, Sector 37C Gurgaon, Gurgaon, Haryana",3rd of 11 Floors,West,1 to 5 Year Old,"['Elvedor Mall', 'MDS Public School', 'K.D. Ho...",This 2 bhk apartment is available for sale in ...,"['2 Wardrobe', '4 Fan', '1 Exhaust Fan', '2 Ge...","['Centrally Air Conditioned', 'Water purifier'...","['Environment5 out of 5', 'Lifestyle4 out of 5..."


In [28]:
# pricePerSqft
df['pricePerSqft'].nunique()

2130

In [29]:
df['pricePerSqft'] = df['pricePerSqft'].str.split('/').str.get(0).str.replace('₹','').str.replace(',','').str.strip().astype('float')

In [30]:
df.sample(5)

Unnamed: 0,property_name,society,price,pricePerSqft,areaWithType,bedRoom,bathroom,balcony,additionalRoom,address,floorNum,facing,agePossession,nearbyLocations,description,furnishDetails,features,rating
1642,3 BHK Flat in Sector 72 Gurgaon,tata primanti,3.4,15560.0,Super Built up area 2185(202.99 sq.m.),3 Bedrooms,4 Bathrooms,3+ Balconies,Servant Room,"Sector 72 Gurgaon, Gurgaon, Haryana",9th of 40 Floors,,5 to 10 Year Old,"['Sector 55-56 Metro Station', 'Omaxe City Cen...",4 bedroom 2185 sq.Ft. Middle floor apartment a...,[],"['Power Back-up', 'Intercom Facility', 'Lift(s...","['Green Area4 out of 5', 'Construction4 out of..."
1869,2 BHK Flat in Sector 66 Gurgaon,emaar mgf palm studios,1.45,14795.0,Super Built up area 1188(110.37 sq.m.)Carpet a...,2 Bedrooms,2 Bathrooms,3 Balconies,,"402, Sector 66 Gurgaon, Gurgaon, Haryana",3rd of 20 Floors,East,1 to 5 Year Old,"['Sri Radhe Krishna Temple', 'Icici bank ATM',...","Emaar mgf the palm drive in sector 66, gurgaon...","['2 Wardrobe', '4 Fan', '3 Geyser', '5 Light',...","['Feng Shui / Vaastu Compliant', 'Intercom Fac...","['Environment3 out of 5', 'Lifestyle4 out of 5..."
123,3 BHK Flat in Sector 89 Gurgaon,m3m soulitude,1.25,8784.0,Super Built up area 1423(132.2 sq.m.),3 Bedrooms,3 Bathrooms,3 Balconies,"Study Room,Others","S-68/3, Sector 89 Gurgaon, Gurgaon, Haryana",3rd of 4 Floors,East,Within 6 months,"['Vatika Town Square-INXT', 'Sector 86 Road', ...",This lovely 3 bhk apartment/flat in sector 89 ...,,,"['Environment4 out of 5', 'Safety4 out of 5', ..."
1704,3 BHK Flat in Sector 102 Gurgaon,shapoorji pallonji joyville gurugram,1.75,9449.0,Super Built up area 1852(172.06 sq.m.)Carpet a...,3 Bedrooms,3 Bathrooms,3 Balconies,"Pooja Room,Study Room,Servant Room,Others","Sector 102 Gurgaon, Gurgaon, Haryana",10th of 26 Floors,West,1 to 5 Year Old,"['Khan Market', 'The Esplanade Mall', 'Dwarka ...",Looking for a 3 bhk property for sale in gurga...,"['14 Light', '1 Modular Kitchen', '1 Chimney',...","['Feng Shui / Vaastu Compliant', 'Security / F...","['Green Area4 out of 5', 'Construction5 out of..."
497,3 BHK Flat in Sector 1 Imt Manesar,sbr minda sec-1 imt manesar,0.64,4193.0,Carpet area: 1526 (141.77 sq.m.),3 Bedrooms,2 Bathrooms,3+ Balconies,,"Sector 1 Imt Manesar, Gurgaon, Haryana",6th of 10 Floors,East,10+ Year Old,"['Orris Community Center', 'Indian Oil', 'Essa...",Flat for sale sec-1 imt manesar good property ...,,"['Power Back-up', 'Intercom Facility', 'Lift(s...","['Environment5 out of 5', 'Lifestyle4.5 out of..."


In [31]:
# bedrooms
df[df['bedRoom'].isnull() == True]

Unnamed: 0,property_name,society,price,pricePerSqft,areaWithType,bedRoom,bathroom,balcony,additionalRoom,address,floorNum,facing,agePossession,nearbyLocations,description,furnishDetails,features,rating
2849,2 BHK Flat in Sector 107 Gurgaon,signature global solera,,,,,,,,,,,,,,,,
2850,4 BHK Flat in Sector 53 Gurgaon,tulip monsella,,33198.0,,,,,,,,,,,,,,
2851,2 BHK Flat in New Palam Vihar,my home,,4400.0,,,,,,,,,,,,,,
2852,2 BHK Flat in Sohna,breez global hill view,,5470.0,,,,,,,,,,,,,,
2922,3 BHK Flat in Sector 99A Gurgaon,pareena coban residences,,5759.0,,,,,,,,,,,,,,
2923,1 BHK Flat in Golf Course Ext Road,ikon tower baani city centre,,12437.0,,,,,,,,,,,,,,
2924,2 BHK Flat in Sector 89 Gurgaon,greenopolis,,4820.0,,,,,,,,,,,,,,
2926,4 BHK Flat in C Block Sushant Lok Phase 1,maple heights,,8888.0,,,,,,,,,,,,,,
2927,2 BHK Flat in Sector 99 Gurgaon,assotech blith,,6593.0,,,,,,,,,,,,,,


#### At *bedroom* column there are 9 missing values and corresponding values in other columns are also missing. That's why rows containing  missing values are removed

In [32]:
df = df[df['bedRoom'].isnull() == False]

In [33]:
df.shape

(2997, 18)

In [34]:
df['bedRoom'] = df['bedRoom'].str.split(' ').str.get(0).astype('int')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['bedRoom'] = df['bedRoom'].str.split(' ').str.get(0).astype('int')


In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2997 entries, 0 to 3016
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   property_name    2997 non-null   object 
 1   society          2997 non-null   object 
 2   price            2996 non-null   float64
 3   pricePerSqft     2996 non-null   float64
 4   areaWithType     2997 non-null   object 
 5   bedRoom          2997 non-null   int64  
 6   bathroom         2997 non-null   object 
 7   balcony          2997 non-null   object 
 8   additionalRoom   1692 non-null   object 
 9   address          2991 non-null   object 
 10  floorNum         2995 non-null   object 
 11  facing           2123 non-null   object 
 12  agePossession    2996 non-null   object 
 13  nearbyLocations  2906 non-null   object 
 14  description      2997 non-null   object 
 15  furnishDetails   2200 non-null   object 
 16  features         2590 non-null   object 
 17  rating           26

In [36]:
df.sample(5)

Unnamed: 0,property_name,society,price,pricePerSqft,areaWithType,bedRoom,bathroom,balcony,additionalRoom,address,floorNum,facing,agePossession,nearbyLocations,description,furnishDetails,features,rating
2925,1 BHK Flat in Sector 95 Gurgaon,rof ananda,0.26,6032.0,Super Built up area 431(40.04 sq.m.)Carpet are...,1,1 Bathroom,1 Balcony,,"F 801, Sector 95 Gurgaon, Gurgaon, Haryana",8th of 14 Floors,North,0 to 1 Year Old,"['Metro', 'Dwarka Expressway', 'Rajeev Chowk',...",All amenities in societies can be seen from go...,"['3 Fan', '1 Exhaust Fan', '5 Light', 'No AC',...","['Feng Shui / Vaastu Compliant', 'Security / F...","['Safety4.5 out of 5', 'Lifestyle4.5 out of 5'..."
2600,3 BHK Flat in Sector 109 Gurgaon,ats tourmaline,2.3,8897.0,Super Built up area 2585(240.15 sq.m.),3,4 Bathrooms,3+ Balconies,"Servant Room,Others","Flat Number 2101, 10th Floor, Tower 2, Sector ...",10th of 27 Floors,East,0 to 1 Year Old,"['Dwarka Sector 21 Metro Station', 'NeoSquare ...",Residential apartment for sell.Located on 10th...,[],"['Feng Shui / Vaastu Compliant', 'Security / F...","['Green Area4.5 out of 5', 'Amenities4.5 out o..."
395,3 BHK Flat in Sector 102 Gurgaon,emaar gurgaon greens,1.35,8181.0,Super Built up area 1650(153.29 sq.m.)Carpet a...,3,3 Bathrooms,3 Balconies,Servant Room,"Sector 102 Gurgaon, Gurgaon, Haryana",9th of 14 Floors,North-East,1 to 5 Year Old,"['JMS Marine Square Mall', 'Dwarka Expressway'...","3 bhi plus servant,spacious flat is ready to m...","['3 Fan', '4 Light', '4 AC', '1 Modular Kitche...","['Intercom Facility', 'Lift(s)', 'Maintenance ...","['Green Area5 out of 5', 'Construction4 out of..."
477,3 BHK Flat in Sector 66 Gurgaon,emaar mgf the palm drive,2.99,14950.0,Super Built up area 2000(185.81 sq.m.)Built Up...,3,3 Bathrooms,3 Balconies,Servant Room,"Sector 66 Gurgaon, Gurgaon, Haryana",7th of 17 Floors,East,1 to 5 Year Old,"['Sector 55-56 Rapid Metro Station', 'HUB 66',...",We deal exclusively in palm drive,"['5 Fan', '1 Exhaust Fan', '3 Geyser', '1 Stov...","['Feng Shui / Vaastu Compliant', 'Security / F...","['Green Area5 out of 5', 'Construction4.5 out ..."
2983,3 BHK Flat in Sector 95 Gurgaon,sidhartha ncr one phase,0.9,4500.0,Built Up area: 2000 (185.81 sq.m.),3,3 Bathrooms,1 Balcony,,"Sector 95 Gurgaon, Gurgaon, Haryana",15th of 16 Floors,,1 to 5 Year Old,"['Yadav Clinic', 'Bangali Clinic', 'Dr. J. S. ...",Residential apartment for sell.Located on 15th...,"['1 Wardrobe', '1 Fan', '1 Light', 'No AC', 'N...","['Power Back-up', 'Lift(s)', 'Swimming Pool', ...","['Management4 out of 5', 'Green Area4 out of 5..."


In [37]:
#bathroom
df[df['bathroom'].isnull() == True]

Unnamed: 0,property_name,society,price,pricePerSqft,areaWithType,bedRoom,bathroom,balcony,additionalRoom,address,floorNum,facing,agePossession,nearbyLocations,description,furnishDetails,features,rating


#### The column 'bathroom' contains no missing value.

In [38]:
df['bathroom'] = df['bathroom'].str.split(' ').str.get(0).astype('int')

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2997 entries, 0 to 3016
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   property_name    2997 non-null   object 
 1   society          2997 non-null   object 
 2   price            2996 non-null   float64
 3   pricePerSqft     2996 non-null   float64
 4   areaWithType     2997 non-null   object 
 5   bedRoom          2997 non-null   int64  
 6   bathroom         2997 non-null   int64  
 7   balcony          2997 non-null   object 
 8   additionalRoom   1692 non-null   object 
 9   address          2991 non-null   object 
 10  floorNum         2995 non-null   object 
 11  facing           2123 non-null   object 
 12  agePossession    2996 non-null   object 
 13  nearbyLocations  2906 non-null   object 
 14  description      2997 non-null   object 
 15  furnishDetails   2200 non-null   object 
 16  features         2590 non-null   object 
 17  rating           26

In [40]:
#balcony
df[df['balcony'].isnull() == True]

Unnamed: 0,property_name,society,price,pricePerSqft,areaWithType,bedRoom,bathroom,balcony,additionalRoom,address,floorNum,facing,agePossession,nearbyLocations,description,furnishDetails,features,rating


In [41]:
df.sample(3)

Unnamed: 0,property_name,society,price,pricePerSqft,areaWithType,bedRoom,bathroom,balcony,additionalRoom,address,floorNum,facing,agePossession,nearbyLocations,description,furnishDetails,features,rating
3000,3 BHK Flat in Sector 78 Gurgaon,umang monsoon breeze,1.2,6493.0,Super Built up area 2250(209.03 sq.m.)Carpet a...,3,3,2 Balconies,Pooja Room,"Sector 78 Gurgaon, Gurgaon, Haryana",8th of 12 Floors,,5 to 10 Year Old,"['Proposed Metro Station', 'Mahapal Shing', 'N...","A 3 bedroom flat, located in sector-78 gurgaon...","['1 Water Purifier', '1 Fridge', '5 Fan', '1 E...","['Piped-gas', 'Natural Light']","['Safety4 out of 5', 'Lifestyle4 out of 5', 'E..."
2714,4 BHK Flat in Valley View Estate,valley view estate,1.5,7489.0,Super Built up area 2000(185.81 sq.m.)Built Up...,4,4,3+ Balconies,,"Tower 14, Valley View Estate, Gurgaon, Haryana",10th of 18 Floors,North-East,5 to 10 Year Old,"['HANUMAN MANDIR', 'SHIV MANDIR BALIYAWAS', 'I...",Awesome flat with all fittings 3 side open cor...,[],"['Feng Shui / Vaastu Compliant', 'Lift(s)', 'M...","['Environment5 out of 5', 'Safety5 out of 5', ..."
1340,3 BHK Flat in Sector 92 Gurgaon,sare homes,0.68,5238.0,Carpet area: 1298 (120.59 sq.m.),3,3,2 Balconies,Pooja Room,"1102, Sector 92 Gurgaon, Gurgaon, Haryana",11st of 21 Floors,North-West,1 to 5 Year Old,"['Yadav Clinic', 'Bangali Clinic', 'Dr. J. S. ...",This beautiful 3 bhk flat in sector 92 gurgaon...,,"['Security / Fire Alarm', 'Feng Shui / Vaastu ...","['Environment5 out of 5', 'Lifestyle4 out of 5..."


#### The column *balcony* Contains value Like '3+'. So you do not know exact amount of number of balconies. That's why we are not changing its data type to integer. We are just replacing no with 0

In [42]:
df['balcony'] = df['balcony'].str.split(' ').str.get(0).str.replace('No','0')

In [43]:
# additionalRoom
df[df['additionalRoom'].isnull() == True].shape

(1305, 18)

In [44]:
df['additionalRoom'].nunique()

49

In [45]:
df['additionalRoom'].fillna('not available',inplace=True)
df['additionalRoom'] = df['additionalRoom'].str.lower()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['additionalRoom'].fillna('not available',inplace=True)


In [46]:
df[df['additionalRoom'].isnull() == True].shape

(0, 18)

In [47]:
df['additionalRoom'].nunique()

50

#### For the column *additionalRoom*, We have just replaced The *Nan* value with *not available*. Moreover we converted all the values to lower case so that in case there are any duplicates it will be removed

In [48]:
# floor num
df['floorNum'].nunique()

550

In [49]:
df['floorNum'].isnull().sum()

np.int64(2)

In [50]:
df[df['floorNum'].isnull() == True]

Unnamed: 0,property_name,society,price,pricePerSqft,areaWithType,bedRoom,bathroom,balcony,additionalRoom,address,floorNum,facing,agePossession,nearbyLocations,description,furnishDetails,features,rating
181,3 BHK Flat in Dwarka Expressway Gurgaon,experion heartsong,1.08,6150.0,Built Up area: 1758 (163.32 sq.m.),3,3,0,not available,"604, Tower B-3, 6th Floor,Sector 108, Dwarka E...",,,Under Construction,,A property by one of the most reputed builders...,[],,
2766,2 BHK Flat in Sector 78 Gurgaon,,0.6,3692.0,Built Up area: 1625 (150.97 sq.m.),2,2,0,not available,"Gurgaon, Sector 78 Gurgaon, Gurgaon, Haryana",,,Under Construction,,The property is under construction it's by rah...,[],,"['Safety4 out of 5', 'Lifestyle4 out of 5', 'E..."


In [51]:
df['floorNum'] = df['floorNum'].str.split(' ').str.get(0).replace('Ground','0').str.replace('Basement','-1').str.replace('Lower','0').str.extract(r'(\d+)')

In [52]:
df.sample(5)

Unnamed: 0,property_name,society,price,pricePerSqft,areaWithType,bedRoom,bathroom,balcony,additionalRoom,address,floorNum,facing,agePossession,nearbyLocations,description,furnishDetails,features,rating
1546,4 BHK Flat in Sector 81 Gurgaon,bestech park view grand spa,4.7,6786.0,Super Built up area 6926(643.45 sq.m.),4,4,3+,servant room,"1908, Sector 81 Gurgaon, Gurgaon, Haryana",19,North,1 to 5 Year Old,"['Sapphire 83 Mall', 'NH-8, IMT Manesar', 'Dwa...",Semi furnished 4bhk 6926sqft penthouse with pe...,"['4 Wardrobe', '1 Exhaust Fan', '4 Geyser', '4...","['Security / Fire Alarm', 'Power Back-up', 'Pr...","['Green Area5 out of 5', 'Construction5 out of..."
1346,2 BHK Flat in Palam Vihar,bestech park view residency,1.01,7137.0,Super Built up area 1415(131.46 sq.m.),2,2,3,not available,"Palam Vihar, Gurgaon, Haryana",0,East,5 to 10 Year Old,"['Dwarka Sector 21 Metro Station', 'HUDA Marke...",Bestech park view residency is one of gurgaon'...,"['2 Wardrobe', '2 Fan', '1 Geyser', '4 Light',...","['Security / Fire Alarm', 'Power Back-up', 'Li...","['Green Area5 out of 5', 'Construction4 out of..."
1205,3 BHK Flat in Sector 1 Imt Manesar,hsiidc sidco aravali,0.81,3129.0,Super Built up area 2588(240.43 sq.m.)Built Up...,3,3,3+,servant room,"Sector 1 Imt Manesar, Gurgaon, Haryana",2,South-West,5 to 10 Year Old,"['Pooja Clinic', 'Dr. Sahil Clinic', 'Prakash ...",Located in the popular residential address of ...,,"['Security / Fire Alarm', 'Power Back-up', 'Fe...","['Environment5 out of 5', 'Lifestyle4.5 out of..."
2347,2 BHK Flat in Sector-5 Sohna,mvn athens sohna gurgaon,0.28,5833.0,Carpet area: 480 (44.59 sq.m.),2,2,1,not available,"Sector-5 Sohna, Gurgaon, Haryana",2,,1 to 5 Year Old,,"Near to dam dama lake, lush green and 3km away...",,"['Feng Shui / Vaastu Compliant', 'Security / F...",
1347,3 BHK Flat in Palam Vihar,bestech park view residency,1.51,7890.0,Super Built up area 1920(178.37 sq.m.),3,4,3+,servant room,"Palam Vihar, Gurgaon, Haryana",8,West,5 to 10 Year Old,"['Dwarka Sector 21 Metro Station', 'HUDA Marke...",Bestech park view residency is one of the most...,"['3 Wardrobe', '5 Fan', '5 Light', '1 Modular ...","['Security / Fire Alarm', 'Power Back-up', 'Li...","['Green Area5 out of 5', 'Construction4 out of..."


In [53]:
df['facing'].value_counts()

Unnamed: 0_level_0,count
facing,Unnamed: 1_level_1
North-East,505
East,490
North,301
South,203
West,183
North-West,162
South-East,144
South-West,135


In [54]:
df['facing'].isnull().sum()

np.int64(874)

In [55]:
df['facing'].fillna('not available',inplace=True)
df['facing'] = df['facing'].str.lower()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['facing'].fillna('not available',inplace=True)


#### We are creating a new feature column *area* And *property_type*

In [56]:
df.insert(loc=4,column='area',value=round((df['price']*10000000)/df['pricePerSqft']))
df.insert(loc=1,column='property_type',value='flat')

In [57]:
df.sample(5)

Unnamed: 0,property_name,property_type,society,price,pricePerSqft,area,areaWithType,bedRoom,bathroom,balcony,additionalRoom,address,floorNum,facing,agePossession,nearbyLocations,description,furnishDetails,features,rating
215,2 BHK Flat in Sector 79 Gurgaon,flat,m3m golfestate,1.4,10000.0,1400.0,Carpet area: 1400 (130.06 sq.m.),2,2,2,"study room,servant room,store room","Sector 79 Gurgaon, Gurgaon, Haryana",4,east,,"['Petrol Pump Indian Oil', 'Petrol Pump', 'Rao...","Individual bar plus ground floor room , two si...","['1 Water Purifier', '1 Fan', '1 Fridge', '1 D...","['Intercom Facility', 'Lift(s)', 'High Ceiling...","['Environment4 out of 5', 'Safety4 out of 5', ..."
2298,2 BHK Flat in Sector 68 Gurgaon,flat,pareena mi casa,1.0,8163.0,1225.0,Carpet area: 1225 (113.81 sq.m.),2,2,2,servant room,"Sector 68 Gurgaon, Gurgaon, Haryana",16,not available,0 to 1 Year Old,"['Sector 55-56 Metro Station', 'Airia Mall', '...","Inner park facing, abundant sunlight, close pr...",,"['Feng Shui / Vaastu Compliant', 'Intercom Fac...","['Environment4 out of 5', 'Lifestyle4 out of 5..."
1154,3 BHK Flat in Sector 61 Gurgaon,flat,pioneer park,2.1,11666.0,1800.0,Super Built up area 1800(167.23 sq.m.)Carpet a...,3,3,3,not available,"Sector 61 Gurgaon, Gurgaon, Haryana",15,south-east,1 to 5 Year Old,"['Sector 55-56 Rapid Metro', 'Hong Kong Bazaar...",3bhk pioneer park sector 61 gurugram\n Additio...,"['3 Wardrobe', '1 Water Purifier', '4 Fan', '1...","['Feng Shui / Vaastu Compliant', 'Security / F...","['Green Area5 out of 5', 'Construction3 out of..."
2411,3 BHK Flat in Sector 86 Gurgaon,flat,pyramid urban homes 2,0.6,8000.0,750.0,Carpet area: 750 (69.68 sq.m.),3,2,2,not available,"Sector 86 Gurgaon, Gurgaon, Haryana",2,east,Within 6 months,"['Sapphire 83 Mall', 'Rampura Flyover, Naurang...",Best location with first open bolcony type of ...,,"['Feng Shui / Vaastu Compliant', 'Lift(s)', 'P...","['Environment5 out of 5', 'Lifestyle5 out of 5..."
2882,2 BHK Flat in Dayanand Colony,flat,divya apartments,0.6,7594.0,790.0,Built Up area: 790 (73.39 sq.m.)Carpet area: 7...,2,2,0,not available,"104,first Floor, Dayanand Colony, Gurgaon, Har...",2,south-west,1 to 5 Year Old,"['Chintapurni Mandir', 'Sheetla Mata Mandir', ...",We are the proud owners of this 2 bhk apartmen...,"['3 Wardrobe', '3 Fan', '1 Exhaust Fan', '6 Li...","['Lift(s)', 'Maintenance Staff', 'Recently Ren...","['Safety4 out of 5', 'Lifestyle5 out of 5', 'E..."


In [58]:
df.to_csv('cleaned_flats.csv',index=False)