## **DATA CLEANING**

### **1. Importing data**

In [66]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.rcParams["figure.figsize"] = (15,8)

In [104]:
data_df = pd.read_csv(r'F:\DataMining\dataset\bengaluru_house_prices.csv')
data_df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [105]:
data_df.shape

(13320, 9)

In [106]:
data_df.columns

Index(['area_type', 'availability', 'location', 'size', 'society',
       'total_sqft', 'bath', 'balcony', 'price'],
      dtype='object')

In [107]:
data_df['area_type'].unique()

array(['Super built-up  Area', 'Plot  Area', 'Built-up  Area',
       'Carpet  Area'], dtype=object)

In [108]:
data_df['area_type'].value_counts()

area_type
Super built-up  Area    8790
Built-up  Area          2418
Plot  Area              2025
Carpet  Area              87
Name: count, dtype: int64

In [110]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13246 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


**Drop những features cho rằng không dùng cho việc build model**

In [113]:
df4 = data_df.drop(['area_type','society','balcony','availability'],axis='columns')
df4.shape

(13320, 5)

In [115]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   location    13319 non-null  object 
 1   size        13304 non-null  object 
 2   total_sqft  13320 non-null  object 
 3   bath        13246 non-null  float64
 4   price       13320 non-null  float64
dtypes: float64(2), object(3)
memory usage: 520.4+ KB


## **Data cleaning: Xử lý các giá trị NA**

In [103]:
df3.isnull().sum()

location       1
size           0
total_sqft     0
bath          74
price          0
dtype: int64

In [75]:
for col in df2.columns:
    missing_data = df2[col].isna().sum()
    missing_percentage = missing_data/len(df2) * 100
    print(f'{col} has {missing_percentage}% missing data')

location has 0.0075075075075075074% missing data
size has 0.12012012012012012% missing data
total_sqft has 0.0% missing data
bath has 0.5555555555555556% missing data
price has 0.0% missing data


In [76]:
x = df2.iloc[:, :-1].values

In [77]:
x

array([['Electronic City Phase II', '2 BHK', '1056', nan],
       ['Chikka Tirupathi', '4 Bedroom', '2600', 5.0],
       ['Uttarahalli', '3 BHK', '1440', 2.0],
       ...,
       ['Raja Rajeshwari Nagar', '2 BHK', '1141', 2.0],
       ['Padmanabhanagar', '4 BHK', '4689', 4.0],
       ['Doddathoguru', '1 BHK', '550', 1.0]],
      shape=(13320, 4), dtype=object)

In [78]:
y = df2.iloc[:,-1].values

In [79]:
y

array([ 39.07, 120.  ,  62.  , ...,  60.  , 488.  ,  17.  ],
      shape=(13320,))

In [80]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
x[:,3] = imputer.fit_transform(x[:,3].reshape(-1, 1)).ravel()

In [97]:
x

array([['Electronic City Phase II', '2 BHK', '1056', 2.6926619356786956],
       ['Chikka Tirupathi', '4 Bedroom', '2600', 5.0],
       ['Uttarahalli', '3 BHK', '1440', 2.0],
       ...,
       ['Raja Rajeshwari Nagar', '2 BHK', '1141', 2.0],
       ['Padmanabhanagar', '4 BHK', '4689', 4.0],
       ['Doddathoguru', '1 BHK', '550', 1.0]],
      shape=(13320, 4), dtype=object)

In [91]:
df2.iloc[:,3] = x[:,3]

In [92]:
df2.isnull().sum()

location      1
size          0
total_sqft    0
bath          0
price         0
dtype: int64

In [96]:
df2['bath'].unique()

array([2.6926619356786956, 5.0, 2.0, 3.0, 4.0, 6.0, 1.0, 9.0, 8.0, 7.0,
       11.0, 10.0, 14.0, 27.0, 12.0, 16.0, 40.0, 15.0, 13.0, 18.0],
      dtype=object)