## HDB Resale Flat Price Distribution 

### Aim: What is the distribution for the HDB Resale Flat Price over the last 3 years?

### Dataset

#### This dataset shows the resale price transactions based on the registration of the resale transactions, which comprises of month, town, flat type, block, street name, storey range, floor area, flat model, lease commencement date, lease remaining period, and the resale price variables.

#### Chart Type: Histogram

#### Source: https://data.gov.sg/dataset/resale-flat-prices

### Methodology

#### Step 1: Import the required libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt

#### Step 2: Import the required dataset

In [None]:
filename = 'C:\\Users\Jeffrey Wong\SP_Assignment_Python\HDB_resale_flat_prices.csv'

data = np.genfromtxt(filename, skip_header = 1, dtype = [('month', 'U10'), ('town', 'U50'),
                                                        ('flat_type', 'U10'), ('block', 'U10'),
                                                        ('street_name', 'U50'), ('storey_range', 'U50'),
                                                        ('floor_area_sqm', 'i8'), ('flat_model', 'U50'),
                                                        ('lease_commence_date', 'U10'), ('remaining_lease', 'i8'),
                                                        ('resale_price', 'f8')], 
                     delimiter = ',', missing_values = ['na', '-', ''])

#### Step 3: Data Cleaning, Manipulation & Extraction

##### Use subsetting  with boolean indexing to determine the exact location of the data of an element on first month of each year and store the indexing values into the assigned variables respectively

In [None]:
### get the index of an element for the first month of year 2016
index_2016 = np.where(data['month'] == '2016-01')
if len(index_2016) > 0 and len(index_2016[0]) > 0:
    position_2016 = index_2016[0][0]
print("The index of an element for the first month of year 2016 is " + str(position_2016))

### get the index of an element for the first month of year 2017
index_2017 = np.where(data['month'] == '2017-01')
if len(index_2017) > 0 and len(index_2017[0]) > 0:
    position_2017 = index_2017[0][0]
print("The index of an element for the first month of year 2017 is " + str(position_2017))

### get the index of an element for the first month of year 2018
index_2018 = np.where(data['month'] == '2018-01')
if len(index_2018) > 0 and len(index_2018[0]) > 0:
    position_2018 = index_2018[0][0]
print("The index of an element for the first month of year 2018 is " + str(position_2018))

### get the index of an element for the first month of year 2019
index_2019 = np.where(data['month'] == '2019-01')
if len(index_2019) > 0 and len(index_2019[0]) > 0:
    position_2019 = index_2019[0][0]
print("The index of an element for the first month of year 2019 is " + str(position_2019))

##### Extract the relevant data using the indexing values above (through slicing) for each year and store them into assigned variables respectively

In [None]:
year_2016 = data[position_2016:position_2017]
year_2017 = data[position_2017:position_2018] 
year_2018 = data[position_2018:position_2019]

print(year_2016)
print()
print(year_2017)
print()
print(year_2018)
print()

##### Extract the resale price values from the new dataset for each year

In [None]:
resale_price_2016 = year_2016['resale_price']
print("The resale price for year 2016 are: ")
print(resale_price_2016)
print()

resale_price_2017 = year_2017['resale_price']
print("The resale price for year 2017 are: ")
print(resale_price_2017)
print()

resale_price_2018 = year_2018['resale_price']
print("The resale price for year 2018 are: ")
print(resale_price_2018)

#### use logical-non operator, ~ to get an array with True everywhere that an array of elements are valid number
##### use logical array to index to the original array to retrieve just the non-NaN values for year 2016, 2017  and 2018

In [None]:
new_resale_price_2016 = resale_price_2016[~np.isnan(resale_price_2016)]
new_resale_price_2017 = resale_price_2017[~np.isnan(resale_price_2017)]
new_resale_price_2018 = resale_price_2018[~np.isnan(resale_price_2018)]
combined_data = [new_resale_price_2016, new_resale_price_2017, new_resale_price_2018]

#### Step 4: Data Visualization using Matplotlib

In [None]:

fig, ax = plt.subplots(nrows = 1, ncols = 3,figsize = (15,8))
    
facecolors = ['tomato', 'plum', 'chocolate']
labels  = ['Resale Price 2016', 'Resale Price 2017', 'Resale Price 2018']
title = ['Year 2016', 'Year 2017', 'Year 2018']
no_of_counts = []

for i in range(3):
    counts, bins, patches = ax[i].hist(combined_data[i], bins = 25, facecolor = facecolors[i], edgecolor = 'k', 
                                           histtype = 'bar', align = 'mid', density = False, label = labels[i])
    no_of_counts.append(counts)
    
    ax[i].set_title(title[i], fontsize = 15, fontweight = 'bold', color = 'k')
    
        
    ### add title and axes labels
    ax[i].set_xlabel('Resale Price (S$)', fontsize = 12, fontweight = 'bold')
    ax[i].set_ylabel('Number of Records', fontsize = 12, fontweight = 'bold')
        
    ### adjust both axes ticks values
    ax[i].tick_params(axis = "y", labelsize = 10, length = 10, width = 2.0, labelcolor = 'black', colors = 'red')
    ax[i].tick_params(axis = "x", labelsize = 10, length = 10, width = 2.0, labelcolor = 'black', colors = 'red', rotation = 45)
        
    ### removing top and right borders
    ax[i].spines['top'].set_visible(False)
    ax[i].spines['right'].set_visible(False)
        
    ### add minorticks 
    ax[i].minorticks_on()
        
    ### set y-axis limits
    ax[i].set_ylim(0,3000)
        
    ### set x-axis limits
    ax[i].set_xlim(20000, 1200000)
        
    ### add legend
    ax[i].legend(loc = 'lower center', fontsize = 12, edgecolor = 'navy', bbox_to_anchor = (0.5, -0.25),
                     ncol = 3, shadow = True)
        

### add suptitle 
fig.suptitle("HDB Resale Price Distribution, Year 2016, 2017 and 2018", fontsize = 15, fontweight = 'bold')
    
### save the image
fig.savefig('histogram.png')

plt.show()


#### Simple Text-Based Analysis using NUMPY

In [None]:
print("***** HDB Resale Flat Price Distribution *****")
print()

##### display the number of rows in this dataset
no_of_rows = len(data)
print("The data corresponding to " + filename + " consists of " + str(no_of_rows) + " rows.")
print()

##### display the total number of rows of data extracted for year 2016, 2017, and 2018
print("How many rows of data corresponding to the following years?")
print("----------------------------------------------------------------------------")
no_of_rows_2016 = len(year_2016)
no_of_rows_2017 = len(year_2017)
no_of_rows_2018 = len(year_2018)
combined_no_of_rows = [no_of_rows_2016, no_of_rows_2017, no_of_rows_2018]
years = ['2016', '2017', '2018']
for i in range(len(combined_no_of_rows)):
    print("The data corresponding to the year " + years[i] + " consists of " + str(combined_no_of_rows[i]) + " rows.")
print()

##### compute the average resale price for the respective years from 2016 to 2018
average_resale_price = []
for j in range(len(years)):
    average_resale_price.append(np.mean(combined_data[j])) ### store the data in Python List
new_average_resale_price = np.array(average_resale_price) ### convert the Python List to Numpy Array

##### display the computed average resale price for the respective years from 2016 to 2018
print("Average HDB Resale Price for the following years: ")
print("----------------------------------------------------------------------------")
for resale_price in range(len(new_average_resale_price)):
    print("The average resale price for the year {} is ${:.2f}".format(years[resale_price], 
                                                                       new_average_resale_price[resale_price]))
print()

##### compute the median resale price for the respective years from 2016 to 2018
median_resale_price = []
for j in range(len(years)):
    median_resale_price.append(np.median(combined_data[i]))
new_median_resale_price = np.array(median_resale_price) ### convert from Python List to Numpy Array

##### display the computed median resale price for the respective eyars from 2016 to 2018
print(" Median HDB Resale Price for the following years: ")
print("----------------------------------------------------------------------------")
for resale_price in range(len(new_median_resale_price)):
    print("The median resale price for the year {} is ${:.2f}".format(years[resale_price], 
                                                                      new_median_resale_price[resale_price]))
print()


##### using mean and median to determine the shape of the histogram distribution for the respective years 2016 to 2018
print("Histogram Distribution Shape for the following years: ")
print("----------------------------------------------------------------------------")
for k in range(len(years)):
    if new_average_resale_price[k] == new_median_resale_price[k]:
        print("The resale price distribution  for the year {} are symmetrical.".format(years[k]))
    elif new_average_resale_price[k] < new_median_resale_price[k]:
        print("The resale price distribution for the year {} are left-skewed.".format(years[k]))
    elif new_average_resale_price[k] > new_median_resale_price[k]:
        print("The resale price distribution for the year {} are right-skewed. ".format(years[k]))
print()

##### display the maximum resale price for the respective years from 2016 to 2018
print("Maximum HDB Resale Price for the following years: ")
print("----------------------------------------------------------------------------")
for max_resale_price in range(len(combined_data)):
    max_value = combined_data[max_resale_price].max()
    print("The maximum resale price for the year {} is ${:.2f}".format(years[max_resale_price], max_value))
print()

##### display the minimum resale price for the respective years from 2016 to 2018
print("Minimum HDB Resale Price for the following years: ")
print("----------------------------------------------------------------------------")
for min_resale_price in range(len(combined_data)):
    min_value = combined_data[min_resale_price].min()
    print("The minimum resale price for the year {} is ${:.2f}".format(years[min_resale_price], min_value))
print()

##### convert the number of counts from Python Lists to Numpy Array
new_no_of_counts = np.array(no_of_counts)

##### display the resale flat price with the highest number of counts or frequency for the following years
print("HDB Resale Flat price with the highest number of resale records for the following years: ")
print("-------------------------------------------------------------------------------------------")
for counts in range(len(new_no_of_counts)):
    max_counts = new_no_of_counts[counts].max()
    print("The highest number of resale records for the year {} is {}".format(years[counts], max_counts))
print()

##### display the resale flat price with the lowest number of counts or freuqnecy for the following years
print("HDB Resale Flat price with the lowest number of resale records for the following years: ")
print("-------------------------------------------------------------------------------------------")
for counts in range(len(new_no_of_counts)):
    min_counts = new_no_of_counts[counts].min()
    print("The lowest number of resale records for the year {} is {}".format(years[counts], min_counts))
print()