## **DATA CLEANING**

### **1. Importing data**

In [157]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.rcParams["figure.figsize"] = (15,8)

In [158]:
data_df = pd.read_csv(r'F:\DataMining\dataset\bengaluru_house_prices.csv')
data_df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,,,Coomee,1056,,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [159]:
data_df.shape

(13320, 9)

In [160]:
data_df.columns

Index(['area_type', 'availability', 'location', 'size', 'society',
       'total_sqft', 'bath', 'balcony', 'price'],
      dtype='object')

In [161]:
data_df['area_type'].unique()

array(['Super built-up  Area', 'Plot  Area', 'Built-up  Area',
       'Carpet  Area'], dtype=object)

In [162]:
data_df['area_type'].value_counts()

area_type
Super built-up  Area    8790
Built-up  Area          2418
Plot  Area              2025
Carpet  Area              87
Name: count, dtype: int64

In [163]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13318 non-null  object 
 3   size          13303 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13246 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


**Drop những features cho rằng không dùng cho việc build model**

In [164]:
df4 = data_df.drop(['area_type','society','balcony','availability'],axis='columns')
df4.shape

(13320, 5)

In [165]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   location    13318 non-null  object 
 1   size        13303 non-null  object 
 2   total_sqft  13320 non-null  object 
 3   bath        13246 non-null  float64
 4   price       13320 non-null  float64
dtypes: float64(2), object(3)
memory usage: 520.4+ KB


## **Data cleaning: Xử lý các giá trị NA**

In [166]:
df4.isnull().sum()

location       2
size          17
total_sqft     0
bath          74
price          0
dtype: int64

In [167]:
for col in df4.columns:
    missing_data = df4[col].isna().sum()
    missing_percentage = missing_data/len(df4) * 100
    print(f'{col} has {missing_percentage}% missing data')

location has 0.015015015015015015% missing data
size has 0.1276276276276276% missing data
total_sqft has 0.0% missing data
bath has 0.5555555555555556% missing data
price has 0.0% missing data


In [168]:
x = df4.iloc[:, :-1].values

In [169]:
x

array([[nan, nan, '1056', nan],
       ['Chikka Tirupathi', '4 Bedroom', '2600', 5.0],
       ['Uttarahalli', '3 BHK', '1440', 2.0],
       ...,
       ['Raja Rajeshwari Nagar', '2 BHK', '1141', 2.0],
       ['Padmanabhanagar', '4 BHK', '4689', 4.0],
       ['Doddathoguru', '1 BHK', '550', 1.0]],
      shape=(13320, 4), dtype=object)

In [170]:
y = df4.iloc[:,-1].values

In [171]:
y

array([ 39.07, 120.  ,  62.  , ...,  60.  , 488.  ,  17.  ],
      shape=(13320,))

In [172]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
x[:,3] = imputer.fit_transform(x[:,3].reshape(-1, 1)).ravel()

In [173]:
x

array([[nan, nan, '1056', 2.6926619356786956],
       ['Chikka Tirupathi', '4 Bedroom', '2600', 5.0],
       ['Uttarahalli', '3 BHK', '1440', 2.0],
       ...,
       ['Raja Rajeshwari Nagar', '2 BHK', '1141', 2.0],
       ['Padmanabhanagar', '4 BHK', '4689', 4.0],
       ['Doddathoguru', '1 BHK', '550', 1.0]],
      shape=(13320, 4), dtype=object)

In [174]:
imputer2 = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
x[:,0] = imputer2.fit_transform(x[:,0].reshape(-1, 1)).ravel()

In [175]:
x

array([['Whitefield', nan, '1056', 2.6926619356786956],
       ['Chikka Tirupathi', '4 Bedroom', '2600', 5.0],
       ['Uttarahalli', '3 BHK', '1440', 2.0],
       ...,
       ['Raja Rajeshwari Nagar', '2 BHK', '1141', 2.0],
       ['Padmanabhanagar', '4 BHK', '4689', 4.0],
       ['Doddathoguru', '1 BHK', '550', 1.0]],
      shape=(13320, 4), dtype=object)

In [177]:
imputer3 = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
x[:,1] = imputer3.fit_transform(x[:,1].reshape(-1, 1)).ravel()

In [178]:
x

array([['Whitefield', '2 BHK', '1056', 2.6926619356786956],
       ['Chikka Tirupathi', '4 Bedroom', '2600', 5.0],
       ['Uttarahalli', '3 BHK', '1440', 2.0],
       ...,
       ['Raja Rajeshwari Nagar', '2 BHK', '1141', 2.0],
       ['Padmanabhanagar', '4 BHK', '4689', 4.0],
       ['Doddathoguru', '1 BHK', '550', 1.0]],
      shape=(13320, 4), dtype=object)

In [None]:
df4.isnull().sum() # becaeuse we have not updated the changes into dataset df4 

location       2
size          17
total_sqft     0
bath          74
price          0
dtype: int64

In [183]:
df4.iloc[:,0] = x[:,0]
df4.iloc[:,1] = x[:,1]
df4.iloc[:, 3] = x[:, 3].astype(float)

In [184]:
df4.isnull().sum()

location      0
size          0
total_sqft    0
bath          0
price         0
dtype: int64

In [185]:
df4.head()

Unnamed: 0,location,size,total_sqft,bath,price
0,Whitefield,2 BHK,1056,2.692662,39.07
1,Chikka Tirupathi,4 Bedroom,2600,5.0,120.0
2,Uttarahalli,3 BHK,1440,2.0,62.0
3,Lingadheeranahalli,3 BHK,1521,3.0,95.0
4,Kothanur,2 BHK,1200,2.0,51.0


## **Feature Engineering**

**Add new feature(Integer) for bhk (Bedrooms Hall Kitchen)**