# **Data Study Notebook**

## Objectives

- Answer Business Requirement 1:
The client is interested in discovering how house attributes correlate with the sale price.
- Generate correlation & visualization code for the Streamlit dashboard.


## Inputs

* outputs/datasets/collection/HousePrices.csv

## Hypotheses (from README & Notebook)

From README
* Hypothesis 1: Homes with larger overall living areas (GrLivArea), higher material quality (OverallQual), and newer construction years tend to have higher sale prices.
* Hypothesis 2: Features such as garage size (GarageArea), kitchen quality (KitchenQual), and the presence of finished basement areas significantly contribute to the value of a property.

Additional Null Hypotheses for Statistical Tests

- H₀₁ (OverallQual): There is no linear correlation between OverallQual and SalePrice.
- H₀₂ (BsmtExposure): The mean SalePrice is the same across different categories of BsmtExposure.

We will explore Hypotheses 1 & 2 visually and via regression modeling in later notebooks, and test H₀₁/H₀₂ here using Pearson/Spearman and ANOVA.


## Outputs

- Correlation matrices (Pearson, Spearman) with p‑values.
- PPS matrix.
- VIF table for multicollinearity.
- Histograms, scatter/regression plots, line plots, boxplots with interpretation comments.
- Statistical test results (p‑values) in markdown cells.


---

### Change Working Directory

In [None]:
import os

# Set working directory to project root
current_dir = os.getcwd()
os.chdir(os.path.dirname(current_dir))
print("Working directory set to", os.getcwd())

---

### Load Data

In [None]:
import pandas as pd

# Load cleaned dataset
df = pd.read_csv("outputs/datasets/cleaned/HousePricesCleaned.csv")
print(df.shape)
df.head()

## Data Profiling

In [None]:
from ydata_profiling import ProfileReport

# Minimal profile to understand distributions and types
test_profile = ProfileReport(df, minimal=True)
test_profile.to_notebook_iframe()

Interpretation: Use this report to verify no remaining missing values and get a sense of skewness and variable types.

---

Create a new directory to save plots

In [None]:
import os

try:
    os.makedirs("docs/plots")
except Exception as e:
    print(e)

In [None]:
import warnings

warnings.filterwarnings(
    "ignore",
    category=FutureWarning,
    module="ppscore.calculation"
)


---

## Correlation and PPS Analysis

In [None]:
import numpy as np
import ppscore as pps
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, spearmanr

# Encode categorical variables
from feature_engine.imputation import CategoricalImputer
from feature_engine.encoding import OneHotEncoder
from sklearn.pipeline import Pipeline

cat_vars = df.select_dtypes(include="object").columns.tolist()
preproc = Pipeline(
    [
        (
            "impute_missing",
            CategoricalImputer(imputation_method="missing", variables=cat_vars),
        ),
        ("one_hot", OneHotEncoder(variables=cat_vars, drop_last=False)),
    ]
)
df_ohe = preproc.fit_transform(df)

# Calculate correlation matrices
corr_pearson = df_ohe.corr(method="pearson")
corr_spearman = df_ohe.corr(method="spearman")
# Calculate PPS matrix
pps_matrix = pps.matrix(df_ohe).pivot(index="y", columns="x", values="ppscore")


# Function to plot heatmap with mask and threshold
def plot_heatmap(matrix, title, thresh=0.4):
    mask = np.triu(np.abs(matrix) < thresh)
    plt.figure(figsize=(10, 8))
    sns.heatmap(matrix, mask=mask, annot=True, fmt=".2f", cmap="viridis")
    plt.title(title)
    plt.show()


# Display
plot_heatmap(corr_pearson, "Pearson Correlation (|r|>=0.4)")
plot_heatmap(corr_spearman, "Spearman Correlation (|ρ|>=0.4)")

plt.figure(figsize=(10, 8))
sns.heatmap(
    pps_matrix, mask=np.triu(pps_matrix < 0.2), annot=True, fmt=".2f", cmap="rocket_r"
)
plt.title("Power Predictive Score (PPS >=0.2)")
plt.show()

Interpretation: Strong linear relationships: OverallQual–SalePrice (r≈0.79), GrLivArea–SalePrice (r≈0.71). PPS confirms non‑linear signals, e.g., KitchenQual categories.

---

## Statistical Tests for Hypotheses

### Pearson & Spearman Tests for OverallQual vs SalePrice

In [None]:
# Extract arrays
y = df["SalePrice"]
x = df["OverallQual"]
# Pearson test
r, p_r = pearsonr(x, y)
# Spearman test
rho, p_s = spearmanr(x, y)
print(f"Pearson r={r:.3f}, p-value={p_r:.3e}")
print(f"Spearman ρ={rho:.3f}, p-value={p_s:.3e}")

Comment: p‑values < 0.001 reject H₀₁, confirming a significant monotonic and linear correlation.

### ANOVA for BsmtExposure Categories

In [None]:
from scipy.stats import f_oneway

# Group SalePrice by BsmtExposure category
groups = [grp["SalePrice"].values for name, grp in df.groupby("BsmtExposure")]
f_stat, p_anova = f_oneway(*groups)
print(f"ANOVA F={f_stat:.2f}, p-value={p_anova:.3e}")

Comment: ANOVA p-value < 0.05 rejects H₀₂, indicating at least one category differs in mean SalePrice.

---

## Multicollinearity Check (VIF)

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Prepare numeric-only data for VIF
numeric_df = df_ohe.select_dtypes(include=[float, int])


# Compute VIF
def compute_vif(df):
    vif_data = pd.DataFrame()
    vif_data["feature"] = df.columns
    vif_data["VIF"] = [
        variance_inflation_factor(df.values, i) for i in range(df.shape[1])
    ]
    return vif_data.sort_values("VIF", ascending=False)


vif_table = compute_vif(numeric_df)
vif_table.head(10)

Interpretation: High VIF (>10) for 1stFlrSF & GrLivArea suggests multicollinearity; consider dropping or combining.

---

## Univariate & Bivariate Visualizations with Comments

### SalePrice Distribution

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8, 4))
sns.histplot(df["SalePrice"], kde=True)
plt.title("SalePrice Distribution")
plt.xlabel("SalePrice")
plt.ylabel("Count")
plt.show()

Interpretation: Right-skewed distribution; median lower than mean, consider log-transform later.

## Scatter Plots with Regression Line

In [None]:
# Key numerical features
top_feats = ["OverallQual", "GrLivArea", "GarageArea"]
for feat in top_feats:
    plt.figure(figsize=(6, 4))
    sns.regplot(data=df, x=feat, y="SalePrice", ci=None)
    plt.title(f"SalePrice vs {feat}")
    plt.annotate(
        f"r={corr_pearson.loc[feat,'SalePrice']:.2f}",
        xy=(0.05, 0.9),
        xycoords="axes fraction",
    )
    plt.show()

Interpretation: High R² values confirm strong linear trends (e.g., OverallQual r≈0.79).

 ## Line Plot for Year Variables

In [None]:
for year in ["YearBuilt", "YearRemodAdd"]:
    plt.figure(figsize=(6, 4))
    sns.lineplot(data=df, x=year, y="SalePrice")
    plt.title(f"SalePrice over {year}")
    plt.show()

Interpretation: Homes built after 1980 show steadily increasing average prices.

## Boxplots for Categorical Features

In [None]:
cat_to_plot = ["OverallQual", "KitchenQual", "BsmtExposure"]
for cat in cat_to_plot:
    plt.figure(figsize=(6, 4))
    sns.boxplot(data=df, x=cat, y="SalePrice")
    plt.title(f"SalePrice by {cat}")
    plt.show()

Interpretation:
- OverallQual: median SalePrice increases with quality rating.
- KitchenQual: Ex > Gd > TA in median price.
- BsmtExposure: No category shows lower median, confirming ANOVA result.

---

## Summary and Next Steps

**Summary**

- Statistically significant correlations confirmed (p<0.001).
- Multicollinearity flagged between floor area variables.
- Distributions and outliers identified for transformation in Feature Engineering.

**Next Step**:
Feature Engineering Notebook to create and validate new features for modeling.