## B. Data Exploration

In this phase, we will answer the following questions:
1. How many rows and how many columns?
2. What is the meaning of each row?
3. Are there duplicated rows?
4. What is the meaning of each column?
5. What is the current data type of each column? Are there columns having
inappropriate data types?
6. With each numerical column, how are values distributed?
    - What is the percentage ofmissing values?
    - Min? max? Are they abnormal?
7. With each categorical column, how are values distributed?
    - What is the percentage of missing values?
    - How many different values? Show a few Are they abnormal?

---
### Import libraries nesscessary 

In [14]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [15]:
# Read data from csv file into Pandas dataframe
rented_house_df = pd.read_csv('../data/HCMHouseRentPreprocessing.csv', sep=',')
rented_house_df.head()

Unnamed: 0,id,title,price,published,acreage,street,ward,district
0,0,"Cho thuê nhà trọ mới sạch đẹp tại Lê Đình Cẩn,...",2200000,2022-05-16,20.0,Lê Đình Cẩn,Phường Tân Tạo,Quận Bình Tân
1,1,Cho thuê phòng trọ giá rẻ ở mặt tiền hẻm lớn Đ...,2500000,2022-04-20,20.0,487/35/25 Đường Huỳnh Tấn Phát,Phường Tân Thuận Đông,Quận 7
2,2,Cho thuê phòng trọ kdc Nam Long-Trần Trọng Cun...,3500000,2022-05-10,30.0,Đường 10,Phường Tân Thuận Đông,Quận 7
3,3,Phòng trọ giá rẻ ngay cổng khu chế xuất Tân Th...,1500000,2022-05-05,30.0,283/15 Huỳnh Tấn Phát,Phường Tân Thuận Đông,Quận 7
4,4,"Cho thuê phòng có gác, không gác, tolet riêng ...",3500000,2022-01-05,18.0,Lê Văn Sỹ,Phường 14,Quận Phú Nhuận


### 1. How many rows and how many columns?

In [16]:
num_rows = rented_house_df.shape[0]
num_cols = rented_house_df.shape[1]
print('Number of rows: ', num_rows)
print('Number of columns: ', num_cols)

Number of rows:  8948
Number of columns:  8


### 2. What is the meaning of each row?

A line indicates the information about renting a house in Ho Chi Minh City. Each line provides prices, Acreage in square meter, published date and the address of the house.


### 3. Are there duplicated rows?

In [17]:
num_duplicated_rows = rented_house_df.index.duplicated(keep='first').sum()
num_duplicated_rows

0

### 4. What is the meaning of each column?

Here is the description about the columns in the file "HCMHouseRentPreprocessing.csv":
- **title**: the title of the rented houses
- **price**: price of the rented houses (if price is -1, it mean the owner want to disscuss more)
- **acreage**:  acreage in square meter of the rented houses
- **street**: the street of the rented houses
- **ward**: the ward of the rented houses
- **district**: the district of the rented houses
    

### 5. What is the current data type of each column? Are there columns having inappropriate data types?

In [18]:
col_dtypes = rented_house_df.dtypes
col_dtypes

id             int64
title         object
price          int64
published     object
acreage      float64
street        object
ward          object
district      object
dtype: object

What does columns that have object dtype mean?**
There are 4 columns with the datatype of `object`: "published", "street", "ward", "district".

In [19]:
def open_object_dtype(s):
    dtypes = set()

    dtypes = set(s.apply(type))
    
    return dtypes

In [20]:
open_object_dtype(rented_house_df['published'])

{str}

Issues that need to be reprocessed:

- The column "published" has `str` data type. To further explore this column, we will perform the preprocessing step of converting it to `datetime` data type.

In [27]:
# Convert dtype of "published" column to datetime
rented_house_df['published'] = pd.to_datetime(rented_house_df['published'], format='%Y/%m/%d')
rented_house_df.head()


Unnamed: 0,id,title,price,published,acreage,street,ward,district
0,0,"Cho thuê nhà trọ mới sạch đẹp tại Lê Đình Cẩn,...",2200000,2022-05-16,20.0,Lê Đình Cẩn,Phường Tân Tạo,Quận Bình Tân
1,1,Cho thuê phòng trọ giá rẻ ở mặt tiền hẻm lớn Đ...,2500000,2022-04-20,20.0,487/35/25 Đường Huỳnh Tấn Phát,Phường Tân Thuận Đông,Quận 7
2,2,Cho thuê phòng trọ kdc Nam Long-Trần Trọng Cun...,3500000,2022-05-10,30.0,Đường 10,Phường Tân Thuận Đông,Quận 7
3,3,Phòng trọ giá rẻ ngay cổng khu chế xuất Tân Th...,1500000,2022-05-05,30.0,283/15 Huỳnh Tấn Phát,Phường Tân Thuận Đông,Quận 7
4,4,"Cho thuê phòng có gác, không gác, tolet riêng ...",3500000,2022-01-05,18.0,Lê Văn Sỹ,Phường 14,Quận Phú Nhuận


In [28]:
rented_house_df['published']

0      2022-05-16
1      2022-04-20
2      2022-05-10
3      2022-05-05
4      2022-01-05
          ...    
8943   2020-10-30
8944   2020-11-23
8945   2022-07-28
8946   2020-11-25
8947   2021-03-08
Name: published, Length: 8941, dtype: datetime64[ns]

### 6. With each numerical column, how are values distributed?
        - What is the percentage of missing values?
        - Min? max? Are they abnormal?

- What is the percentage of missing values?

In [22]:
num_cols_df = pd.DataFrame(columns=rented_house_df.columns.drop(['title', 'published', 'street', 'ward', 'district']))

num_missing_val = rented_house_df[num_cols_df.columns].isnull().sum()

num_cols_df.loc['missing_ratio'] = num_missing_val / num_rows * 100

num_cols_df


Unnamed: 0,id,price,acreage
missing_ratio,0.0,0.0,0.0


- Min? max? Are they abnormal?

In [23]:
rented_house_df.describe()

Unnamed: 0,id,price,acreage
count,8948.0,8948.0,8948.0
mean,4721.086835,3423750.0,26.326056
std,2721.921222,3388884.0,29.888507
min,0.0,-1000000.0,0.0
25%,2390.75,2300000.0,20.0
50%,4708.5,3200000.0,25.0
75%,7057.25,4000000.0,30.0
max,9501.0,150000000.0,1000.0


- Because the acreage of the rented houses can not equal 0. So we need to remove them.

In [24]:
# remove the rows with acreage is 0
rented_house_df = rented_house_df[rented_house_df['acreage'] != 0]

In [25]:
rented_house_df.describe()

Unnamed: 0,id,price,acreage
count,8941.0,8941.0,8941.0
mean,4721.301309,3425055.0,26.346667
std,2721.512619,3389699.0,29.891123
min,0.0,-1000000.0,2.0
25%,2392.0,2300000.0,20.0
50%,4709.0,3200000.0,25.0
75%,7056.0,4000000.0,30.0
max,9501.0,150000000.0,1000.0


### 7. With each categorical column, how are values distributed?
        - What is the percentage of missing values?
        - How many different values? Show a few Are they abnormal?

In [26]:
cate_cols_df = pd.DataFrame(columns=rented_house_df.columns.drop(['price', 'acreage']))

num_missing_val = rented_house_df[cate_cols_df.columns].isnull().sum()

cate_cols_df.loc['missing_ratio'] = num_missing_val / num_rows * 100

cate_cols_df.loc['num_diff_vals'] = rented_house_df.apply(lambda x: x.nunique())

cate_cols_df.loc['diff_vals'] = rented_house_df.apply(lambda x: x[~x.isnull()].unique())

cate_cols_df

Unnamed: 0,id,title,published,street,ward,district
missing_ratio,0.0,0.0,0.0,0.0,0.0,0.0
num_diff_vals,8941.0,8935.0,1671.0,5628.0,166.0,24.0
diff_vals,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...",[Cho thuê nhà trọ mới sạch đẹp tại Lê Đình Cẩn...,"[2022-05-16T00:00:00.000000000, 2022-04-20T00:...","[Lê Đình Cẩn, 487/35/25 Đường Huỳnh Tấn Phát, ...","[Phường Tân Tạo, Phường Tân Thuận Đông, Phường...","[Quận Bình Tân, Quận 7, Quận Phú Nhuận, Quận 3..."
