# **Data Study Notebook**

## Objectives

- Answer Business Requirement 1:
The client is interested in discovering how house attributes correlate with the sale price.
- Generate correlation & visualization code for the Streamlit dashboard.


## Inputs

* outputs/datasets/collection/HousePrices.csv

## Outputs

* Generate code that answers business requirement 1 and can be used to build the Streamlit App.
* Creating plots to see correlation for the selected variables.


---

## Change Working Directory

In [None]:
import os
current_dir = os.getcwd()
current_dir

We change the working directory to the parent of the current one using os.path.dirname() and os.chdir().

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

## Load Data

In [None]:
import pandas as pd

df = pd.read_csv("outputs/datasets/cleaned/HousePricesCleaned.csv")
df.head()

## Data Exploration

We start with a data profiling report to better understand the dataset. This helps us review variable types, distributions, missing values, and how each variable may relate to the target.

We use the ydata-profiling library to generate this report.

In [None]:
from ydata_profiling import ProfileReport
profile = ProfileReport(df, minimal=True)
profile.to_notebook_iframe()

The profiling report shows:
- Nine features contain missing values.
- EnclosedPorch and WoodDeckSF have the highest missing rates—90.7% and 89.4%, respectively.

---

## Correlation and PPS Analysis

Our dataset includes four categorical variables stored as objects. We need to encode them to calculate correlation coefficients.

In [None]:
from feature_engine.encoding import OneHotEncoder

encoder = OneHotEncoder(
    variables=df.columns[df.dtypes == "object"].to_list(), drop_last=False
)
df_ohe = encoder.fit_transform(df)
print(df_ohe.shape)
df_ohe.head(5)

- After encoding, the dataset expands to 37 columns, with each category represented as a separate binary column (0 or 1).
- Next, we define functions to calculate correlation values, generate heatmaps, and display them.
- Each heatmap is saved to the docs directory for use in project documentation.

In [None]:
%matplotlib inline
import numpy as np
import ppscore as pps
import seaborn as sns
sns.set(style="whitegrid")
import matplotlib.pyplot as plt

def heatmap_corr(df,threshold, figsize=(20,12), font_annot = 8):
  """
  Function to create heatmap using correlations.
  """
  if len(df.columns) > 1:
    mask = np.zeros_like(df, dtype=np.bool)
    mask[np.triu_indices_from(mask)] = True
    mask[abs(df) < threshold] = True

    fig, axes = plt.subplots(figsize=figsize)
    sns.heatmap(df, annot=True, xticklabels=True, yticklabels=True,
                mask=mask, cmap='viridis', annot_kws={"size": font_annot}, ax=axes,
                linewidth=0.5
                     )
    axes.set_yticklabels(df.columns, rotation = 0)
    plt.ylim(len(df.columns),0)
     # Save heatmaps to docs folder
    if df.name == "corr_spearman":
      plt.savefig(f'docs/plots/heatmap_corr_spearman.png', bbox_inches='tight')
    else:
      plt.savefig(f'docs/plots/heatmap_corr_pearson.png', bbox_inches='tight')
    plt.show()


def heatmap_pps(df,threshold, figsize=(20,12), font_annot = 8):
    """
    Function to create heatmap using pps.
    """
    if len(df.columns) > 1:

      mask = np.zeros_like(df, dtype=np.bool)
      mask[abs(df) < threshold] = True

      fig, ax = plt.subplots(figsize=figsize)
      ax = sns.heatmap(df, annot=True, xticklabels=True,yticklabels=True,
                       mask=mask,cmap='rocket_r', annot_kws={"size": font_annot},
                       linewidth=0.05,linecolor='grey')
      
      plt.ylim(len(df.columns),0)
      # Save heatmap to docs folder
      plt.savefig(f'docs/plots/heatmap_pps.png', bbox_inches='tight')
      plt.show()


def CalculateCorrAndPPS(df):
  """
  Function to calculate correlations and pps.
  """
  df_corr_spearman = df.corr(method="spearman")
  df_corr_spearman.name = 'corr_spearman'
  df_corr_pearson = df.corr(method="pearson")
  df_corr_pearson.name = 'corr_pearson'

  pps_matrix_raw = pps.matrix(df)
  pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

  pps_score_stats = pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T
  print("PPS threshold - check PPS score IQR to decide threshold for heatmap \n")
  print(pps_score_stats.round(3))

  return df_corr_pearson, df_corr_spearman, pps_matrix


def DisplayCorrAndPPS(df_corr_pearson, df_corr_spearman, pps_matrix,CorrThreshold,PPS_Threshold,
                      figsize=(20,12), font_annot=8 ):
  """
  Function to display the correlations and pps.
  """

  print("\n")
  print("* Analyze how the target variable for your ML models are correlated with other variables (features and target)")
  print("* Analyze multi-colinearity, that is, how the features are correlated among themselves")

  print("\n")
  print("*** Heatmap: Spearman Correlation ***")
  print("It evaluates monotonic relationship \n")
  heatmap_corr(df=df_corr_spearman, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

  print("\n")
  print("*** Heatmap: Pearson Correlation ***")
  print("It evaluates the linear relationship between two continuous variables \n")
  heatmap_corr(df=df_corr_pearson, threshold=CorrThreshold, figsize=figsize, font_annot=font_annot)

  print("\n")
  print("*** Heatmap: Power Predictive Score (PPS) ***")
  print(f"PPS detects linear or non-linear relationships between two columns.\n"
        f"The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
  heatmap_pps(df=pps_matrix,threshold=PPS_Threshold, figsize=figsize, font_annot=font_annot)


In [None]:
df_corr_pearson, df_corr_spearman, pps_matrix = CalculateCorrAndPPS(df)

In [None]:
DisplayCorrAndPPS(
    df_corr_pearson=df_corr_pearson,
    df_corr_spearman=df_corr_spearman,
    pps_matrix=pps_matrix,
    CorrThreshold=0.4,
    PPS_Threshold=0.2,
    figsize=(12, 10),
    font_annot=10,
)

In [None]:
corr_spearman = (
    df_ohe.corr(method="spearman")["SalePrice"]
    .sort_values(key=abs, ascending=False)[1:]
    .head(10)
)
corr_spearman

In [None]:
corr_pearson = (
    df_ohe.corr(method="pearson")["SalePrice"]
    .sort_values(key=abs, ascending=False)[1:]
    .head(10)
)
corr_pearson

In [None]:
top_n = 10
set(corr_pearson[:top_n].index.to_list() + corr_spearman[:top_n].index.to_list())

In [None]:
vars_to_study = [
    "1stFlrSF",
    "GarageArea",
    "GrLivArea",
    "KitchenQual",
    "MasVnrArea",
    "OpenPorchSF",
    "OverallQual",
    "TotalBsmtSF",
    "YearBuilt",
    "YearRemodAdd",
]
vars_to_study

## Create EDA Dataset

We run EDA using features moderately or strongly correlated with SalePrice, creating a DataFrame that includes the target for visualization.

In [None]:
df_eda = df[vars_to_study + ["SalePrice"]]
df_eda.head()

### Target Analysis

Distribution of Sales Price

In [None]:
sns.set_style("whitegrid")
target_var = "SalePrice"
time = ["YearBuilt", "YearRemodAdd"]


def plot_target_hist(df, target_var):
    """
    Function to plot a histogram of the target and
    save the figure to folder.
    """
    plt.figure(figsize=(12, 5))
    sns.histplot(data=df, x=target_var, kde=True)
    plt.title(f"Distribution of {target_var}", fontsize=20)
    plt.savefig(f"docs/plots/hist_plot_{target_var}.png", bbox_inches="tight")
    plt.show()


plot_target_hist(df, target_var)

### Bivariate plots

To address business requirement 1, we plot SalePrice against key numerical and categorical features to explore their relationships.

In [None]:
def plot_lm(df, col, target_var):
    """
    Function to create linear regression plots of the target and
    features with continuous values.
    It saves each figure to folder.
    """
    plt.figure(figsize=(12, 5))
    sns.lmplot(data=df, x=col, y=target_var, ci=None)
    plt.title(f"{col}", fontsize=20)
    plt.savefig(f"docs/plots/lm_plot_price_by_{col}.png", bbox_inches="tight")
    plt.show()


def plot_line(df, col, target_var):
    """
    Function to create a line plot of the target and
    time variables (years).
    It saves each figure to folder.
    """
    plt.figure(figsize=(12, 5))
    sns.lineplot(data=df, x=col, y=target_var)
    plt.title(f"{col}", fontsize=20)
    plt.savefig(f"docs/plots/line_plot_price_by_{col}.png", bbox_inches="tight")
    plt.show()


def plot_box(df, col, target_var):
    """
    Function to create a box plot of the target and
    categorical variables.
    It saves each figure to folder.
    """
    plt.figure(figsize=(8, 5))
    sns.boxplot(data=df, x=col, y=target_var)
    plt.title(f"{col}", fontsize=20)
    plt.savefig(f"docs/plots/box_plot_price_by_{col}", bbox_inches="tight")
    plt.show()


for col in vars_to_study:
    if len(df_eda[col].unique()) <= 10:
        plot_box(df_eda, col, target_var)
        print("\n\n")
    else:
        if col in time:
            plot_line(df_eda, col, target_var)
            print("\n\n")
        else:
            plot_lm(df_eda, col, target_var)
            print("\n\n")

The plots show that higher feature values generally align with higher sale prices. However, outliers are present, similar to what we saw in the target variable.
We’ll address these outliers during feature engineering to prepare the data for modeling.

---

## Summary and the Next Steps

**Summary**

The correlation analysis confirms:
- 'OverallQual', 'GrLivArea', and 'YearBuilt' show strong correlation with 'SalePrice'.
- Variables related to size (e.g., 1stFlrSF, TotalBsmtSF) and condition (e.g., OverallQual) are most predictive.
- From 1980 onwards, newer homes see significant price increases.

**Next Step**:
Proceed to Feature Engineering notebook.

