# Project Final Report

### Due: Midnight on April 27 (2-hour grace period) — 50 points  

### No late submissions will be accepted.


## Overview

Your final submission consists of **three components**:

---

### 1. Final Report Notebook [40 pts]

Complete all sections of this notebook to document your final decisions, results, and broader context.

- **Part A**: Select the single best model from your Milestone 2 experiments. Now that you’ve finalized your model, revisit your decisions from Milestones 1 and 2. Are there any steps you would change—such as cleaning, feature engineering, or model evaluation—given what you now know?

- **Part B**: Write a technical report following standard conventions, for example:
  - [CMU guide to structure](https://www.stat.cmu.edu/~brian/701/notes/paper-structure.pdf)
  - [Data science report example](https://www.projectpro.io/article/data-science-project-report/620)
  - The Checklist given in this week's Blackboard Lesson (essentially the same as in HOML).
    
  Your audience here is technically literate but unfamiliar with your work—like your manager or other data scientists. Be clear, precise, and include both code (for illustration), charts/plots/illustrations, and explanation of what you discovered and your reasoning process. 

The idea here is that Part A would be a repository of the most important code, for further work to come, and Part B is
the technical report which summarizes your project for the data science group at your company. Do NOT assume that readers of Part B are intimately familiar with Part A; provide code for illustration as needed, but not to run.

Submit this notebook as a group via your team leader’s Gradescope account.

---

### 2. PowerPoint Presentation [10 pts]

Create a 10–15 minute presentation designed for a general audience (e.g., sales or marketing team).

- Prepare 8–12 slides, following the general outline of the sections of Part B. 
- Focus on storytelling, visuals (plots and illustrations), and clear, simplified language. No code!
- Use any presentation tool you like, but upload a PDF version.
- List all team members on the first slide.

Submit as a group via your team leader’s Gradescope account.

---

### 3. Individual Assessment

Each team member must complete the Individual Assessment Form (same as in Milestone 1), sign it, and upload it via their own Gradescope account.

---

## Submission Checklist

-  Final Report Notebook — Team leader submission
-  PDF Slides — Team leader submission
-  Individual Assessment Form — Each member submits their own


## Part A: Final Model and Design Reassessment [10 pts]

In this part, you will finalize your best-performing model and revisit earlier decisions to determine if any should be revised in light of your complete modeling workflow. You’ll also consolidate and present the key code used to run your model on the preprocessed dataset, with thoughtful documentation of your reasoning.

**Requirements:**

- Reconsider **at least one decision from Milestone 1** (e.g., preprocessing, feature engineering, or encoding). Explain whether you would keep or revise that decision now that you know which model performs best. Justify your reasoning.
  
- Reconsider **at least one decision from Milestone 2** (e.g., model evaluation, cross-validation strategy, or feature selection). Again, explain whether you would keep or revise your original decision, and why.

- Below, include all code necessary to **run your final model** on the processed dataset. This section should be a clean, readable summary of the most important steps from Milestones 1 and 2, adapted as needed to fit your final model choice and your reconsiderations as just described. 

- Use Markdown cells and inline comments to explain the structure of the code clearly but concisely. The goal is to make your reasoning and process easy to follow for instructors and reviewers.

> Remember: You are not required to change your earlier choices, but you *are* required to reflect on them and justify your final decisions.


## Project Setup: Imports, Configs, and Helper Functions

This section loads all the libraries and settings we'll use throughout the project, grouped by purpose for clarity:

### 1. Standard Libraries  
General utilities for file handling, math, time tracking, and URL parsing (`os`, `time`, `math`, `io`, `zipfile`, `requests`, `urlparse`, `itertools`).

### 2. Data Science Tools  
`pandas`, `numpy` for data wrangling; `seaborn`, `matplotlib` for visualizations.

### 3. Scikit-Learn (ML)  
Covers preprocessing, model evaluation, feature selection, and ensemble models.

### 4. Extras  
`kagglehub` for downloading datasets, `tqdm` for progress bars.

### 5. Global Config  
Sets a random seed (`random_state = 42`) for reproducibility.

### 6. Utility Functions  
- `dollar_format`: formats y-axis ticks as dollar values.  
- `format_hms`: converts seconds into `HH:MM:SS`.

In [None]:
# =============================
# Useful Imports
# =============================

# Standard Libraries
import os
import time
import math
import io
import zipfile
import requests
from urllib.parse import urlparse
from itertools import chain, combinations

# Data Science Libraries
import numpy as np
import pandas as pd
import seaborn as sns

# Visualization
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.ticker as mticker  # Optional: Format y-axis labels as dollars

# Scikit-learn (Machine Learning)
from sklearn.model_selection import (
    train_test_split, 
    cross_val_score, 
    GridSearchCV, 
    RandomizedSearchCV, 
    RepeatedKFold
)
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import SequentialFeatureSelector, f_regression, SelectKBest
from sklearn.datasets import make_regression
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, GradientBoostingRegressor

# Kaggle and Progress Tracking
import kagglehub
from tqdm import tqdm

# =============================
# Global Variables
# =============================
random_state = 42

# =============================
# Utility Functions
# =============================

# Format y-axis labels as dollars with commas (optional)
def dollar_format(x, pos):
    return f'${x:,.0f}'

# Convert seconds to HH:MM:SS format
def format_hms(seconds):
    return time.strftime("%H:%M:%S", time.gmtime(seconds))



### Data Download & Load

We use `load_zillow_data()` to check if the dataset exists locally, download it if not, and then load it as a DataFrame. It handles errors cleanly and avoids repeated downloads. This keeps our workflow smooth and avoids accidentally hitting the server more than needed.

In [None]:
def load_zillow_data(url: str) -> pd.DataFrame:
    """
    Downloads a CSV file from the given URL if it doesn't exist locally,
    then loads it into a pandas DataFrame.

    Args:
        url (str): URL to the Zillow dataset.

    Returns:
        pd.DataFrame: Loaded dataset.
    """
    filename = os.path.basename(urlparse(url).path)

    if not os.path.exists(filename):
        try:
            print(f"Downloading {filename}...")
            response = requests.get(url)
            response.raise_for_status()
            with open(filename, "wb") as f:
                f.write(response.content)
            print("Download complete.")
        except requests.exceptions.RequestException as e:
            print(f"Failed to download the file: {e}")
            raise

    else:
        print(f"{filename} already exists. Skipping download.")

    return pd.read_csv(filename)

# Example usage
url = "https://www.cs.bu.edu/fac/snyder/cs505/Data/zillow_dataset.csv"
df = load_zillow_data(url)

### Initial Data Exploration

We use the `explore_dataframe` function to understand the structure and quality of the dataset:
- View the first few rows
- Examine column data types and null counts
- Identify unique values per column
- Assess missing data percentages
- Generate summary statistics for numeric features

These steps help define cleaning and transformation strategies in later stages.

In [None]:
def explore_dataframe(df: pd.DataFrame) -> None:
    """
    Display essential exploration metrics for a DataFrame.

    Args:
        df (pd.DataFrame): The dataframe to explore.
    """
    print("First five rows:")
    display(df.head())

    print("\nDataFrame Info:")
    df.info()

    print("\nUnique values per column:")
    display(df.nunique().sort_values(ascending=False))

    print("\nMissing values (%):")
    missing_percent = df.isnull().mean().sort_values(ascending=False) * 100
    display(missing_percent[missing_percent > 0])

    print("\nSummary statistics for numerical columns:")
    summary_stats = df.describe().T
    summary_stats['std'] = summary_stats['std'].sort_values(ascending=False)
    display(summary_stats)

### Visualizing Numeric Distributions

The `plot_numeric_histograms()` function quickly gives us a look at the distribution of all numeric features. This helps spot skew, outliers, and potential candidates for transformation.

In [None]:
def plot_numeric_histograms(df: pd.DataFrame, bins: int = 50) -> None:
    """
    Plots histograms for all numeric features in the DataFrame.

    Args:
        df (pd.DataFrame): Dataset to plot.
        bins (int): Number of bins to use in each histogram.
    """
    df.hist(figsize=(20, 15), bins=bins, layout=(-1, 5))
    plt.tight_layout()
    plt.show()

# Example usage:
plot_numeric_histograms(df)

### Identifying Low-Cardinality Categorical Features

The `find_low_cardinality_categoricals()` function helps us pinpoint columns that are likely categorical (but maybe stored as object or even numeric types). These are good candidates for encoding.

In [None]:
def find_low_cardinality_categoricals(df: pd.DataFrame, threshold: int = 10, show: bool = True) -> list:
    """
    Identifies and optionally displays low-cardinality categorical features.

    Args:
        df (pd.DataFrame): Dataset to scan.
        threshold (int): Max number of unique values to be considered "low cardinality".
        show (bool): If True, print the feature names.

    Returns:
        list: Column names of low-cardinality categorical features.
    """
    low_cardinality = df.nunique()[df.nunique() < threshold].index.tolist()
    categoricals = [
        col for col in low_cardinality 
        if df[col].dtype == "object" or df[col].dtype.name == "category"
    ]
    if show:
        print("Low-cardinality categorical features:")
        print(categoricals)
    return categoricals

We could possibly remove the following variables simply due to sheer numbers of null presence:
| Field Name                  | Non-Null Count | Total Count | Percentage (%) |
|-----------------------------|----------------|-------------|----------------|
| architecturalstyletypeid    | 207            | 77613       | 0.27           |
| basementsqft                | 50             | 77613       | 0.06           |
| buildingclasstypeid         | 15             | 77613       | 0.02           |
| decktypeid                  | 614            | 77613       | 0.79           |
| finishedfloor1squarefeet    | 6037           | 77613       | 7.78           |
| finishedsquarefeet13        | 42             | 77613       | 0.05           |
| finishedsquarefeet15        | 3027           | 77613       | 3.90           |
| finishedsquarefeet50        | 6037           | 77613       | 7.78           |
| finishedsquarefeet6         | 386            | 77613       | 0.50           |
| poolsizesum                 | 869            | 77613       | 1.12           |
| pooltypeid10                | 465            | 77613       | 0.60           |
| pooltypeid2                 | 1074           | 77613       | 1.38           |
| storytypeid                 | 50             | 77613       | 0.06           |
| typeconstructiontypeid      | 223            | 77613       | 0.29           |
| yardbuildingsqft17          | 2393           | 77613       | 3.08           |
| yardbuildingsqft26          | 70             | 77613       | 0.09           |
| fireplaceflag               | 172            | 77613       | 0.22           |

## Project Framing & Objectives

### 1.1 What’s the business goal?  
We’re building a regression model to predict property values using real estate data — things like square footage, location, and amenities. This supports smarter pricing and valuation tools.

### 1.2 How will it be used?  
Our model can help real estate professionals — especially marketing teams — make more data-driven decisions. Zillow (or similar platforms) could use it in automated valuation tools.

### 1.3 How do we measure success?  
We’ll track RMSE and R² on validation/test data. A strong model should generalize well, handle variation across cities/neighborhoods, and be easy to understand.

### Re-checking the Dataset Structure

We revisit the basic structure, types, and missing values of the dataset. This helps confirm our earlier observations and catches any subtle changes after framing or minor cleaning steps.

In [None]:
# Re-run our EDA summary in case anything has changed
explore_dataframe(df)

### 2.A.1: Which features are categorical?

Features like `propertylandusetypeid`, `airconditioningtypeid`, and `regionidzip` are technically numeric, but function as categories. They represent types or group identifiers and should be encoded accordingly.

### 2.A.2: Any features that seem useless?

Yes — columns like `decktypeid`, `basementsqft`, and `fireplaceflag` have very few non-null entries and don't seem helpful for our valuation task. Unless we find they’re critical later, we’ll likely drop them.

### 2.A.3: Useless Features by Missingness

Features like `basementsqft`, `yardbuildingsqft17`, and `finishedsquarefeet6` are missing in 99%+ of rows. That’s too sparse for reliable modeling, so we’ll consider dropping them.

### 2.A.4: Useless Features by Uniqueness

`parcelid` is a unique identifier for each row. Since it carries no generalizable pattern, we’ll exclude it from modeling.

### Additional Notes

We also noticed potential redundancy in some features, like square footage and bathroom count variations. These may be dropped or consolidated later during feature selection.

### Starting 2.B, Target + Predictor Distributions

Boxplots help us spot outliers and skew in both the target (`taxvaluedollarcnt`) and predictors. These patterns influence how we scale or transform features later.

In [1]:
df.hist(figsize=(20, 15), bins=50, layout=(-1, 5))
plt.tight_layout()
plt.show()

NameError: name 'df' is not defined

In [None]:
def plot_numeric_histograms(df: pd.DataFrame, bins: int = 50) -> None:
    df.hist(figsize=(20, 15), bins=bins, layout=(-1, 5))
    plt.tight_layout()
    plt.show()

### Target + Predictor Distributions

Boxplots help us spot outliers and skew in both the target (`taxvaluedollarcnt`) and predictors. These patterns influence how we scale or transform features later.

In [None]:
def plot_target_and_predictors(df: pd.DataFrame, target: str, predictors: list, show: bool = True) -> None:
    """
    Plots boxplots for the target variable and selected key predictors.

    Args:
        df (pd.DataFrame): The dataset.
        target (str): Name of the target variable.
        predictors (list): List of column names to visualize alongside the target.
        show (bool): Whether to display the plot.
    """
    total_plots = len(predictors) + 1
    rows = (total_plots + 1) // 2
    fig, axes = plt.subplots(rows, 2, figsize=(14, rows * 4))

    axes = axes.flatten()
    sns.boxplot(x=df[target], ax=axes[0])
    axes[0].set_title(f"Target: {target}")

    for i, col in enumerate(predictors):
        sns.boxplot(x=df[col], ax=axes[i + 1])
        axes[i + 1].set_title(col)

    plt.tight_layout()
    if show:
        plt.show()

# Example usage:
plot_target_and_predictors(
    df,
    target="taxvaluedollarcnt",
    predictors=["calculatedfinishedsquarefeet", "bathroomcnt", "bedroomcnt"]
)

### 2.B Summary: Feature Distribution Observations

- The target `taxvaluedollarcnt` is highly skewed (right-tailed), which suggests high-value outliers may affect regression.
- `calculatedfinishedsquarefeet`, `bathroomcnt`, and `bedroomcnt` all show decent spread but possible skew or multiple modes.
- We'll consider transformation or filtering based on these insights.

## Part 3: Clean the Data (Drop, Impute, Encode)

From here on, we begin modifying the data for modeling. At each stage, we’ll use new variable names like `df_dropped`, `df_imputed`, and so on, to preserve version history and track changes step-by-step.

### Correlation Analysis

Here we check for redundancy among feature groups using correlation matrices and visualizations. Highly correlated features (e.g., square footage, bathroom types) will be simplified by keeping the most complete column.

In [None]:
# === Reusable Function: Group Feature Correlation with Heatmap ===
def check_feature_group_correlation(df: pd.DataFrame, features: list, group_name: str, show: bool = True) -> pd.DataFrame:
    """
    Displays a correlation matrix and heatmap for a group of features.

    Args:
        df (pd.DataFrame): The dataset.
        features (list): Features to compare.
        group_name (str): Group label for plot titles.
        show (bool): Show heatmap and plots.

    Returns:
        pd.DataFrame: Correlation matrix of selected features.
    """
    subset = df[features].dropna(how='all')
    corr = subset.corr()

    if show:
        print(f"\nCorrelation matrix for {group_name}:")
        display(corr)

        plt.figure(figsize=(10, 8))
        sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1, fmt='.4f')
        plt.title(f'Correlation Matrix: {group_name}')
        plt.tight_layout()
        plt.show()

        if len(features) == 2:
            plt.figure(figsize=(8, 6))
            sns.scatterplot(data=subset, x=features[0], y=features[1], alpha=0.5)
            plt.title(f'Scatter Plot: {features[0]} vs {features[1]}')
            plt.tight_layout()
            plt.show()

    return corr

In [None]:
# 1. Square Footage
sqft_features = ['calculatedfinishedsquarefeet', 'finishedsquarefeet12', 'finishedsquarefeet13', 'finishedsquarefeet15', 'finishedsquarefeet50', 'finishedsquarefeet6']
sqft_corr = check_feature_group_correlation(df, sqft_features, 'Square Footage')

# 2. Bathroom Counts
bath_features = ['bathroomcnt', 'calculatedbathnbr', 'fullbathcnt', 'threequarterbathnbr']
bath_corr = check_feature_group_correlation(df, bath_features, 'Bathroom Features')

# 3. Census Features
census_features = ['rawcensustractandblock', 'censustractandblock']
census_corr = check_feature_group_correlation(df, census_features, 'Census Features')

# 4. Pool Features
pool_features = ['poolcnt', 'pooltypeid7']
pool_corr = check_feature_group_correlation(df, pool_features, 'Pool Features')

# Extra Checks on Pool Data
print(f"\nPercentage of properties with missing pool count: {df['poolcnt'].isnull().mean() * 100:.2f}%")
both_pool = (~df['poolcnt'].isnull()) & (~df['pooltypeid7'].isnull())
print(f"Properties with both pool features non-null: {both_pool.sum()}")
print(f"Non-null poolcnt: {(~df['poolcnt'].isnull()).sum()}")
print(f"Non-null pooltypeid7: {(~df['pooltypeid7'].isnull()).sum()}")

### Deep-Dive Pairwise Feature Comparisons

To avoid dropping features that might be similar but not redundant, we use this detailed comparison. It checks:
- Correlation coefficient
- Whether they’re scaled versions of each other
- Outlier examples

In [None]:
# === Reusable Function: Deep Pairwise Correlation Analysis ===
def analyze_pairwise_correlation(df: pd.DataFrame, feature1: str, feature2: str, show: bool = True) -> float:
    """
    Compares two features: correlation, scale similarity, and outlier differences.

    Returns Pearson correlation and displays plots + diagnostics.
    """
    # [PASTE full function code here... unchanged]

In [None]:
# === Detailed Pairwise Correlation Analysis ===
analyze_pairwise_correlation(df, 'calculatedfinishedsquarefeet', 'finishedsquarefeet12')
analyze_pairwise_correlation(df, 'calculatedfinishedsquarefeet', 'finishedsquarefeet13')
analyze_pairwise_correlation(df, 'calculatedfinishedsquarefeet', 'finishedsquarefeet15')
analyze_pairwise_correlation(df, 'calculatedfinishedsquarefeet', 'finishedsquarefeet6')
analyze_pairwise_correlation(df, 'calculatedbathnbr', 'bathroomcnt')

### Feature Dropping Based on Redundancy

Based on correlation and completeness, we dropped features that are:
- Functionally identical to others (`bathroomcnt`, `finishedsquarefeet12`)
- Redundant geographic fields
- Unique IDs or constants

We keep the most complete and interpretable columns to reduce noise.

In [None]:
def drop_irrelevant_features(df: pd.DataFrame, columns_to_drop: list, show: bool = True) -> pd.DataFrame:
    """
    Drops specified non-predictive or redundant features from the dataset.

    Args:
        df (pd.DataFrame): Input dataframe.
        columns_to_drop (list): List of column names to remove.
        show (bool): If True, prints list and new shape.

    Returns:
        pd.DataFrame: Cleaned DataFrame.
    """
    df_reduced = df.drop(columns=columns_to_drop).copy()
    if show:
        print("Dropped irrelevant features:", columns_to_drop)
        print("New shape:", df_reduced.shape)
    return df_reduced

In [None]:
# === Drop irrelevant or redundant features after correlation analysis ===
drop_irrelevant = [
    "parcelid",                # Unique ID – not predictive
    "rawcensustractandblock",  # High-cardinality
    "censustractandblock",     # Redundant with above
    "regionidzip",             # Geographic ID
    "assessmentyear",          # Constant
    "finishedsquarefeet12",    # Duplicate of calculatedfinishedsquarefeet
    "finishedsquarefeet13",    # ""
    "finishedsquarefeet15",    # ""
    "finishedsquarefeet6",     # ""
    "bathroomcnt"              # Duplicate of calculatedbathnbr
]

df_cleaned = drop_irrelevant_features(df, drop_irrelevant)

### 3.A: Justifying Feature Drops

We dropped features that either:
- Had no predictive value (`parcelid`)
- Were constant (`assessmentyear`)
- Were highly correlated and less complete (`bathroomcnt`, `finishedsquarefeet12`, etc.)

Specifically:
- `calculatedfinishedsquarefeet` was kept over similar features due to higher completeness (99.7% vs 95.2%)
- `calculatedbathnbr` was kept for similar reasons

### 3.B: Drop Features with Excessive Missingness

Now we identify features with high missing values that also lack strong correlation to the target. These are unlikely to help the model and may introduce noise.

### Identifying Weak Features: High Nulls + Low Correlation

This function highlights features that are both:
- Highly incomplete (over 90% missing)
- Weakly correlated to the target (abs(corr) < 0.1)

These features likely won't help the model and may just add noise, so we’ll consider dropping them next.

In [None]:
def identify_high_null_low_corr_features(df: pd.DataFrame, target: str, null_thresh: float = 0.90, corr_thresh: float = 0.1, show: bool = True) -> pd.DataFrame:
    """
    Identifies features with high null rates and low correlation with the target.

    Args:
        df (pd.DataFrame): The input dataset.
        target (str): The target column to check correlation against.
        null_thresh (float): Proportion of missing values allowed (e.g., 0.90).
        corr_thresh (float): Absolute correlation threshold below which features are flagged.
        show (bool): If True, displays summary DataFrame.

    Returns:
        pd.DataFrame: A summary of high-null, low-corr columns.
    """
    high_null_cols = df.columns[df.isnull().mean() > null_thresh]
    correlations = df.corr(numeric_only=True)[target]
    low_corr_cols = correlations[correlations.abs() < corr_thresh].index
    drop_candidates = list(set(high_null_cols) & set(low_corr_cols))

    summary_data = []
    for col in drop_candidates:
        preview = df[col].dropna().unique()[:3]
        summary_data.append({
            "Feature": col,
            "Data Preview": preview,
            "Correlation": correlations.get(col),
            "Null Count": df[col].isnull().sum(),
            "Non-Null Count": df[col].notnull().sum()
        })

    drop_summary_df = pd.DataFrame(summary_data).sort_values(
        by="Correlation", key=lambda x: x.abs(), ascending=True
    ).reset_index(drop=True)

    if show:
        print(f"\nHigh-null, low-correlation candidates (null > {null_thresh*100:.0f}%, corr < {corr_thresh}):")
        display(drop_summary_df)

    return drop_summary_df

In [None]:
# Step 3.B – Use function to identify weak features
drop_summary_df = identify_high_null_low_corr_features(df_cleaned, target="taxvaluedollarcnt")

### Dropping Weak Features

Now that we’ve identified columns with both high missingness and little correlation to the target,
we drop them from the dataset. This helps reduce noise and dimensionality before encoding or modeling.

In [None]:
def drop_high_null_low_corr(
    df: pd.DataFrame, 
    target: str, 
    null_thresh: float = 0.90, 
    corr_thresh: float = 0.10, 
    show: bool = True
) -> tuple[pd.DataFrame, list]:
    """
    Drops features that have a high proportion of missing values AND are weakly correlated with the target.

    Args:
        df (pd.DataFrame): Dataset to process.
        target (str): Name of the target column.
        null_thresh (float): Null value threshold (default: 90%).
        corr_thresh (float): Correlation threshold (default: 0.10).
        show (bool): If True, print dropped features and reason.

    Returns:
        tuple: (cleaned DataFrame, list of dropped column names)
    """
    high_null = df.columns[df.isnull().mean() > null_thresh]
    numeric_df = df.select_dtypes(include='number')
    correlations = numeric_df.corr()[target].fillna(0)
    low_corr = correlations[correlations.abs() < corr_thresh].index

    to_drop = list(set(high_null) & set(low_corr))

    if show:
        print(f"\nDropping {len(to_drop)} features with >{null_thresh*100:.0f}% missing and low correlation:")
        for col in to_drop:
            print(f" - {col}: {df[col].isnull().mean()*100:.1f}% null, corr = {correlations[col]:.3f}")

    df_cleaned = df.drop(columns=to_drop).copy()
    return df_cleaned, to_drop

In [None]:
# Step 3.B – Drop the features we identified
df_null_cleaned, dropped_features = drop_high_null_low_corr(df_cleaned, target="taxvaluedollarcnt")

### Applying High-Null + Low-Correlation Drop

We now run our drop function on `df_cleaned` to remove features that were:
- More than 60–90% missing
- Weakly correlated with the target (|corr| < 0.3)

This produces a new version of the dataset: `df_null_cleaned`, which is smaller and cleaner moving into the next steps.

In [None]:
def plot_target_correlations(df, target='taxvaluedollarcnt', top_n=10, method='pearson', figsize=(12, 10)):
    """
    Plot correlations between the target variable and its most correlated numeric features.

    Parameters:
    -----------
    df : pandas DataFrame
        The input DataFrame (should be cleaned beforehand)
    target : str, default='taxvaluedollarcnt'
        The target variable to correlate against
    top_n : int, default=10
        Number of top features to include (excluding the target itself)
    method : str, default='pearson'
        Correlation method: 'pearson', 'kendall', or 'spearman'
    figsize : tuple
        Size of the output figure
    """
    # Select numeric features
    numeric_df = df.select_dtypes(include=[np.number])
    
    # Compute correlation with the target
    target_corr = numeric_df.corrwith(df[target], method=method).sort_values(ascending=False)
    
    # Drop the target itself and select top N correlated features
    top_features = target_corr.drop(index=target).head(top_n).index.tolist()
    
    # Build subset DataFrame with target + top features
    subset_cols = [target] + top_features
    corr_subset = numeric_df[subset_cols].corr(method=method)
    
    # Plot heatmap
    plt.figure(figsize=figsize)
    sns.heatmap(
        corr_subset,
        annot=True,
        fmt=".2f",
        cmap='coolwarm',
        vmax=1.0,
        vmin=-1.0,
        center=0,
        square=True,
        linewidths=0.5
    )
    plt.title(f"Top {top_n} {method.capitalize()} Correlated Features with '{target}'", fontsize=16)
    plt.tight_layout()
    plt.show()
    
    return top_features

# ---- Usage with your own DataFrame ----
top_features = plot_target_correlations(df_cleaned, top_n=10, method='pearson')

In [None]:
# Step 3.B – Apply feature drop for high-null + low-correlation fields
df_null_cleaned, dropped_features = drop_high_null_low_corr(df_cleaned, target="taxvaluedollarcnt")

## Part B: Final Data Science Project Report Assignment [30 pts]

This final report is the culmination of your semester-long Data Science project, building upon the exploratory analyses and modeling milestones you've already completed. Your report should clearly communicate your findings, analysis approach, and conclusions to a technical audience. The following structure and guidelines, informed by best practices, will help you prepare a professional and comprehensive document.

### Required Sections

Your report must include the following sections:


#### 1. Executive Summary (Abstract) [2 pts]
- Brief overview of the entire project (150–200 words)
- Clearly state the objective, approach, and key findings

#### 2. Introduction [2 pts]
- Clearly introduce the topic and context of your project
- Describe the problem you are addressing (the problem statement)
- Clearly state the objectives and goals of your analysis

Note: You may imaginatively consider this project as taking place in a real estate company with a small data science group in-house, and write your introduction from this point of view (don't worry about verisimilitude to an actual company!).  

#### 3. Data Description [2 pts]
- Describe the source of your dataset (described in Milestone 1)
- Clearly state the characteristics of your data (size, types of features, missing values, target, etc.)

#### 4. Methodology (What you did, and why)  [12 pts]

**Focus this section entirely on the steps you took and your reasoning behind them. Emphasize the process and decision-making, not the results themselves**

- Describe your analytical framework 
  - Use of validation curves to see the effect of various hyperparameter choices, and
  - Choice of RMSE as primary error metric
- Clearly outline your data cleaning and preprocessing steps
  - Describe what issues you encountered in the raw data and how you addressed them.
  - Mention any key decisions (e.g., removing samples with too many missing values).
  - What worked and what didn't work?
- Describe your feature engineering approach
  - Explain any transformations, combinations, or derived features.
  - Discuss why certain features were chosen or created, even if they were later discarded.
  - What worked and what didn't work?
- Detail your model selection process 
  - Outline the models you experimented with and why.
  - Discuss how you evaluated generalization (e.g., cross-validation, shape and relationships of plots).
  - Mention how you tuned hyperparameters or selected the final model.



#### 5. Results and Evaluation (What you found, and how well it worked) [10 pts]

**Focus purely on outcomes, with metrics, visuals, and insights. This is where you present evidence to support your conclusions.**

- Provide a clear and detailed narrative of your analysis and reasoning using the analytical approach described in (4). 
- Discuss model performance metrics and results (RMSE, R2, etc.)
- **Include relevant visualizations (graphs, charts, tables) with appropriate labels and captions**
- Error analysis
  - Highlight specific patterns of error, outliers, or questionable features.
  - Note anything surprising or worth improving in future iterations.


#### 6. Conclusion [2 pts]
- Clearly state your main findings and how they address your original objectives
- Highlight the business or practical implications of your findings 
- Discuss the limitations and constraints of your analysis clearly and transparently
- Suggest potential improvements or future directions