# Data Visualisations Notebook

## Objectives

* Answer business requirement 1 
    * Second iteration of Data Visualisations
    * Identify features that are most correlated to sale price (including data that was originally categorical, that was not included in the correlation study in version 1 of this notebook.)
    * Generate data visualisations of correlated features against sale price.

## Inputs

* outputs/datasets/cleaned/house_prices_records_cleaned.csv

## Outputs

* Generate code that answers business requirement 1 by providing most relevant variables that correlate with sale price.
* Provide data visualisations including:
    * Heatmaps from correlation studies.
    * Scatter plots for most important house attributes vs. SalePrice.
    * Bar plots showing which attributes are most important in correlation studies.
    * Pie charts showing the importance of house attributes for predicting SalePrice based on correlation studies.
* Draw conclusions on which house attributes are most correlated with SalePrice.

## Conclusions

* The following features have a strong positive correlation with sale price: {'1stFlrSF', 'GarageArea', 'GrLivArea', 'OverallQual', 'TotalBsmtSF', 'YearBuilt', 'YearRemodAdd'}

## Additional Comments
This version uses a cleaned dataset as its input which allows data visualisations to be performed on all variables (except three that were dropped in data cleaning). This will give more valuable and comprehensive insights to answer the client's business requirement 1.


---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Load Data

In [None]:
import pandas as pd
df = (pd.read_csv("outputs/datasets/cleaned/house_prices_records_cleaned.csv"))
df.head(3)

---

# Data Exploration

Generate pandas report to gain insights on dataset, including data types, missing data and distribution.

In [None]:
from ydata_profiling import ProfileReport
pandas_report = ProfileReport(df=df, minimal=True)
pandas_report.to_notebook_iframe()

Observations from pandas profile report
* The dataset has 21 variables (columns) and 1460 observations (rows).
* The dataset has only numeric variables.
* The data for sales price has a positively skewed distribution, with the majority of observations for sales prices in the range $130k to $214k (interquartile range). There is limited sales price data in the range from $350k+. It will be difficult to build a model that can accurately predict sale prices above $350k and this may be a problem for the client if any of their four inherited houses fall in this range, this is a potential limitation of the model and would need to be discussed with the client.

---

# Detailed Correlation Study

* We use the `.corr()` method with `spearman` to determine the top 10 variables most strongly correlated to SalePrice.
* The list is sliced at index 1 to exclude the first element because we don't want to include SalePrice v SalePrice.

In [None]:
corr_spearman = df.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10) 
corr_spearman  

* We use the `.corr()` method with `pearson` to determine the top 10 variables most strongly correlated to SalePrice.
* The list is sliced at index 1 to exclude the first element because we don't want to include SalePrice v SalePrice.

In [None]:
corr_pearson = df.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:].head(10) 
corr_pearson

* Moderate to very strong correlations were noticed with both Spearman and Pearson methods.
* We will consider only the variables that are strongly correlated with a value > 0.5 or < -0.5.
* We concatenate the strong variables list from spearman and pearson correlations. 'set' ensures each variable appears only once in the new list.

In [None]:
pearson_variables_strong = corr_pearson[(corr_pearson > 0.5) | (corr_pearson < -0.5)].index.tolist()
spearman_variables_strong = corr_spearman[(corr_spearman > 0.5) | (corr_spearman < -0.5)].index.tolist()

variables_with_strong_correlation = set(pearson_variables_strong + spearman_variables_strong)
variables_with_strong_correlation

We now have seven of the most important variables for predicting house sale price, all of which are strongly correlated to SalePrice. These are listed in alphabetical order, not order of importance.
* First floor area in square feet
* Garage area in square feet
* Above grade (ground) living area in square feet
* Overall quality of materials and finishes
* Total basement area in square feet
* Original construction date
* Year of remodelling (or build if it has not been remodelled)

In [None]:
vars_to_study = ['1stFlrSF',
                'GarageArea',
                'GrLivArea',
                'OverallQual',
                'TotalBsmtSF',
                'YearBuilt',
                'YearRemodAdd'] 
vars_to_study

## Detailed Correlation and PPS Analysis

The following code taken from CI Churnometer project is designed to visualise the correlation and Predictive Power Score (PPS) matrices of the dataframe, in the following way:
* heatmap_corr: generates a heatmap to visualise the correlation matrix of the df. It masks values below a certain threshold and annotates the heatmap with correlation coefficients.
* heatmap_pps: generates a heatmap to visualise the PPS matrix of the df. It masks values below a certain threshold and annotates the heatmap with PPS scores.
* CalculateCorrAndPPS: calculates the Pearson and Spearman correlation matrices, as well as the PPS matrix of the input df.
* DisplayCorrAndPPS: displays the correlation matrices (Pearson and Spearman) and the PPS matrix using heatmaps.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ppscore as pps


def heatmap_corr(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=np.bool)
        mask[np.triu_indices_from(mask)] = True
        mask[abs(df) < threshold] = True

        fig, axes = plt.subplots(figsize=figsize)
        sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                    mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                    linewidth=0.5
                    )
        axes.set_yticklabels(df.columns, rotation=0)
        plt.ylim(len(df.columns), 0)
        plt.show()


def heatmap_pps(df, threshold, figsize=(20, 12), font_annot=8):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=np.bool)
        mask[abs(df) < threshold] = True
        fig, ax = plt.subplots(figsize=figsize)
        ax = sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                         mask=mask, cmap='rocket_r', annot_kws={"size": font_annot},
                         linewidth=0.05, linecolor='grey')
        plt.ylim(len(df.columns), 0)
        plt.show()


def CalculateCorrAndPPS(df):
    df_corr_spearman = df.corr(method="spearman")
    df_corr_pearson = df.corr(method="pearson")

    pps_matrix_raw = pps.matrix(df)
    pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

    pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
    print("PPS threshold - check PPS score IQR to decide threshold for heatmap \n")
    print(pps_score_stats.round(3))

    return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix, CorrThreshold, PPS_Threshold,
                      figsize=(20, 12), font_annot=8):

    print("\n")
    print("* Analyse how the target variable for your ML models are correlated with other variables (features and target)")
    print("* Analyse multi-colinearity, that is, how the features are correlated among themselves")

    print("\n")
    print("*** Heatmap: Spearman Correlation ***")
    print("It evaluates monotonic relationship \n")
    heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Pearson Correlation ***")
    print("It evaluates the linear relationship between two continuous variables \n")
    heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

    print("\n")
    print("*** Heatmap: Power Predictive Score (PPS) ***")
    print(f"PPS detects linear or non-linear relationships between two columns.\n"
          f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
    heatmap_pps(df=pps_matrix, threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)

Calculate Correlations and Power Predictive Score

In [None]:
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df)

* From the PPS interquartile range we can see the majority of values are between 0 and 0.028, which indicates there is a concentration of features with moderate predictive power.
* A maximum PPS score of 0.579 suggests that there is at least one feature in the dataset that has a moderate association with another feature. Note this figure is lower than the original ppscore (in 02_Data_Cleaning_v2) because the feature 'GarageYrBlt' has been dropped and this was strongly corrlated to 'YearBuilt'.
* A threshold of 0.2 was chosen to display features with low to moderate correlation or predictive power.

Generate heatmaps

In [None]:
DisplayCorrAndPPS(df_corr_pearson = df_corr_pearson,
                  df_corr_spearman = df_corr_spearman, 
                  pps_matrix = pps_matrix,
                  CorrThreshold = 0.4, PPS_Threshold =0.2,
                  figsize=(12,10), font_annot=10)

Observations from heatmap analyses:
* Sales price has moderate to strong monotonic relationship (Spearman correlation) with 11 house attributes.
* Sales price has moderate to strong linear relationship (Pearson correlation) with 8 house attributes.
* OverallQual and GrLivArea have the strongest predictive power for the target variable.
* Moderate multicollinearity correlations exist between several features, as expected.
* We accept the detailed correlation study as being appropriate to generate data visualisations for the client with cleaned data. 

# EDA on selected variables

Create a new dataframe with the only selected variables and SalePrice for exploratory data analysis (EDA).

In [None]:
df_eda = df.filter(vars_to_study + ['SalePrice']) 
df_eda.head(3)

## Variables Distribution by Sale Price

* We plot the variables of interest against the SalePrice
* We use a linear regression model to add a line of best fit to the data

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')


def plot_numerical(df_eda, col):
    """scatterplots of variables vs SalePrice """
    target_var = 'SalePrice'
    for col in vars_to_study:
        fig, axes = plt.subplots(figsize=(8, 5))
        axes = sns.scatterplot(data=df_eda, x=col, y=target_var)
        plt.title(f"{col}", fontsize=20, y=1.05)
        plt.show()
        print("\n\n")
plot_numerical(df_eda, vars_to_study)




## Generate bar plots based on most important features from correlation study

In [None]:
def display_spearman_corr_bar(df):
    """ Calcuate and display Spearman Correlation """
    corr_spearman = df.corr(method='spearman')['SalePrice'].sort_values(
        key=abs, ascending=False)[1:]
    fig, axes = plt.subplots(figsize=(6, 3))
    axes = plt.bar(x=corr_spearman[:5].index, height=corr_spearman[:5])
    plt.title(
        "Spearman Correlation of Attributes with Sale Price",
        fontsize=15, y=1.05
        )
    plt.xticks(rotation=90)
    plt.ylabel("Spearman Coefficient")
    plt.show

display_spearman_corr_bar(df)

In [None]:
def display_pearson_corr_bar(df):
    """ Calcuate and display Pearson Correlation """
    corr_pearson = df.corr(method='pearson')['SalePrice'].sort_values(
        key=abs, ascending=False)[1:]
    fig, axes = plt.subplots(figsize=(6, 3))
    axes = plt.bar(x=corr_pearson[:5].index, height=corr_pearson[:5])
    plt.title(
        "Pearson Correlation of Attributes with Sale Price",
        fontsize=15, y=1.05
        )
    plt.xticks(rotation=90)
    plt.ylabel("Pearson Coefficient")
    plt.show

display_pearson_corr_bar(df)

## Generate pie charts showing relative importance of house attributes for predicting SalePrice

Generate dataframe from Spearman coefficients

In [None]:
corr_spearman_for_pie = df.corr(method='spearman')['SalePrice'].sort_values(key=abs, ascending=False)[1:]
corr_spearman_for_pie

In [None]:
# Reset index to make 'Feature' a column
spearman_data = corr_spearman_for_pie.reset_index() 

# Rename columns
spearman_data.columns = ['Feature', 'Correlation']

# Convert to DataFrame
df_corr_spearman = pd.DataFrame(spearman_data) 

print(df_corr_spearman)

Generate pie chart from Spearman coefficients dataframe

In [None]:
# Take the absolute value of the correlation values
df_corr_spearman['Correlation'] = df_corr_spearman['Correlation'].abs()

# Normalize the correlation values
df_corr_spearman['Normalized_Correlation'] = df_corr_spearman['Correlation'] / df_corr_spearman['Correlation'].sum()

# Plot a pie chart
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 8))
plt.pie(df_corr_spearman['Normalized_Correlation'], labels=df_corr_spearman['Feature'], autopct='%1.1f%%')
plt.title('Importance of House Attributes for Predicting Sale Price based on Spearman Correlations')
plt.show()


Generate dataframe from Pearson coefficients

In [None]:
corr_pearson_for_pie = df.corr(method='pearson')['SalePrice'].sort_values(key=abs, ascending=False)[1:]
corr_pearson_for_pie

In [None]:
pearson_data = corr_pearson_for_pie.reset_index() 

pearson_data.columns = ['Feature', 'Correlation']

df_corr_pearson = pd.DataFrame(pearson_data) 

print(df_corr_pearson)

Generate pie chart from Pearson coefficients dataframe

In [None]:
df_corr_pearson['Correlation'] = df_corr_pearson['Correlation'].abs()


df_corr_pearson['Normalized_Correlation'] = df_corr_pearson['Correlation'] / df_corr_pearson['Correlation'].sum()

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 8))
plt.pie(df_corr_pearson['Normalized_Correlation'], labels=df_corr_pearson['Feature'], autopct='%1.1f%%')
plt.title('Importance of House Attributes for Predicting Sale Price based on Pearson Correlations')
plt.show()

# Conclusions

* Correlation heat maps have been generated with cleaned data. 
* Scatterplots have been made to allow the client to visualise how the most important house attributes correlate with sales price.
* Bar plots have been generated to show most correlated features from correlation studies.
* Pie charts have been generated to show the client which attributes contribute most strongly to sale price.

* The following variables are most important and have a moderate or strong positive correlation with sale price.
    * First floor area in square feet
    * Garage area in square feet
    * Above grade (ground) living area in square feet
    * Overall quality of materials and finishes
    * Total basement area in square feet
    * Original construction date
    * Year of remodelling (or build if it has not been remodelled)


## Next Steps
* Feature engineering - identify transformations to apply to specific variables, in preparation for ML modelling.