# Project Final Report

### Due: Midnight on April 27 (2-hour grace period) — 50 points  

### No late submissions will be accepted.


## Overview

Your final submission consists of **three components**:

---

### 1. Final Report Notebook [40 pts]

Complete all sections of this notebook to document your final decisions, results, and broader context.

- **Part A**: Select the single best model from your Milestone 2 experiments. Now that you’ve finalized your model, revisit your decisions from Milestones 1 and 2. Are there any steps you would change—such as cleaning, feature engineering, or model evaluation—given what you now know?

- **Part B**: Write a technical report following standard conventions, for example:
  - [CMU guide to structure](https://www.stat.cmu.edu/~brian/701/notes/paper-structure.pdf)
  - [Data science report example](https://www.projectpro.io/article/data-science-project-report/620)
  - The Checklist given in this week's Blackboard Lesson (essentially the same as in HOML).
    
  Your audience here is technically literate but unfamiliar with your work—like your manager or other data scientists. Be clear, precise, and include both code (for illustration), charts/plots/illustrations, and explanation of what you discovered and your reasoning process. 

The idea here is that Part A would be a repository of the most important code, for further work to come, and Part B is
the technical report which summarizes your project for the data science group at your company. Do NOT assume that readers of Part B are intimately familiar with Part A; provide code for illustration as needed, but not to run.

Submit this notebook as a group via your team leader’s Gradescope account.

---

### 2. PowerPoint Presentation [10 pts]

Create a 10–15 minute presentation designed for a general audience (e.g., sales or marketing team).

- Prepare 8–12 slides, following the general outline of the sections of Part B. 
- Focus on storytelling, visuals (plots and illustrations), and clear, simplified language. No code!
- Use any presentation tool you like, but upload a PDF version.
- List all team members on the first slide.

Submit as a group via your team leader’s Gradescope account.

---

### 3. Individual Assessment

Each team member must complete the Individual Assessment Form (same as in Milestone 1), sign it, and upload it via their own Gradescope account.

---

## Submission Checklist

-  Final Report Notebook — Team leader submission
-  PDF Slides — Team leader submission
-  Individual Assessment Form — Each member submits their own


## Part A: Final Model and Design Reassessment [10 pts]

In this part, you will finalize your best-performing model and revisit earlier decisions to determine if any should be revised in light of your complete modeling workflow. You’ll also consolidate and present the key code used to run your model on the preprocessed dataset, with thoughtful documentation of your reasoning.

**Requirements:**

- Reconsider **at least one decision from Milestone 1** (e.g., preprocessing, feature engineering, or encoding). Explain whether you would keep or revise that decision now that you know which model performs best. Justify your reasoning.
  
- Reconsider **at least one decision from Milestone 2** (e.g., model evaluation, cross-validation strategy, or feature selection). Again, explain whether you would keep or revise your original decision, and why.

- Below, include all code necessary to **run your final model** on the processed dataset. This section should be a clean, readable summary of the most important steps from Milestones 1 and 2, adapted as needed to fit your final model choice and your reconsiderations as just described. 

- Use Markdown cells and inline comments to explain the structure of the code clearly but concisely. The goal is to make your reasoning and process easy to follow for instructors and reviewers.

> Remember: You are not required to change your earlier choices, but you *are* required to reflect on them and justify your final decisions.


## Import Necessary Libraries

In [None]:
# =============================
# Useful Imports
# =============================

# Standard Libraries
import os
import time
import math
import io
import zipfile
import requests
from urllib.parse import urlparse
from itertools import chain, combinations

# Data Science Libraries
import numpy as np
import pandas as pd
import seaborn as sns

# Visualization
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.ticker as mticker  # Optional: Format y-axis labels as dollars

# Scikit-learn (Machine Learning)
from sklearn.model_selection import (
    train_test_split,
    cross_val_score,
    GridSearchCV,
    RandomizedSearchCV,
    RepeatedKFold
)
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import root_mean_squared_error
from sklearn.feature_selection import SequentialFeatureSelector, f_regression, SelectKBest
from sklearn.datasets import make_regression
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
# Kaggle and Progress Tracking
import kagglehub
from tqdm import tqdm

# =============================
# Global Variables
# =============================
random_state = 42

# =============================
# Utility Functions
# =============================

# Format y-axis labels as dollars with commas (optional)
def dollar_format(x, pos):
    return f'${x:,.0f}'

# Convert seconds to HH:MM:SS format
def format_hms(seconds):
    return time.strftime("%H:%M:%S", time.gmtime(seconds))

## Load Data Function

In [None]:
def load_data(path, target_column=None, random_state=random_state):
    # Load the dataset
    df = pd.read_csv(path)

    # check if the dataset has been loaded properly
    if df.empty:
        raise ValueError("the dataset is empty, please check the path or the file content.")
    print(f"loaded dataset with shape: {df.shape}")

    # Split the data into features and target variables
    X = df.drop(columns=target_column)
    y = df[target_column]

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=random_state
    )

    # create scaled versions of the train and test features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.fit_transform(X_test)

    # Convert scaled data back to DataFrame for better readability
    X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
    X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

    return X_train, X_test, y_train, y_test, X_train_scaled, X_test_scaled, df

In [None]:
# Load the data from github and preprocess the data
X_train, X_test, y_train, y_test, X_train_scaled, X_test_scaled, df = load_data(
    'https://raw.githubusercontent.com/ys1433/Module-3-Assignments/refs/heads/main/zillow_cleaned.csv',
    target_column='taxvaluedollarcnt'
)

## Run and Evaluate Model Functions

In [None]:
def run_model(model, x_train, y_train, return_model=False, n_repeats=5, n_jobs=-1, random_state=random_state, **model_params):

    # Instantiate the model if a class is provided, so for example can use either BaggingRegressor or BaggingRegressor() as argument. 
    if isinstance(model, type):
        model = model(**model_params)

    neg_rmse_scores = cross_val_score(model, x_train, y_train, scoring = 'neg_root_mean_squared_error',
                                     cv = RepeatedKFold(n_splits=5, n_repeats=n_repeats, random_state=random_state), n_jobs = n_jobs)
    
    mean_cv_rmse = -np.mean(neg_rmse_scores)
    std_cv_rmse  = np.std(neg_rmse_scores)
    
    # Fit the model on the full training set
    model.fit(x_train, y_train)
    
    # Compute training RMSE
    if return_model:
        train_preds = model.predict(x_train)
        train_rmse = root_mean_squared_error(y_train, train_preds)
        
        return mean_cv_rmse, std_cv_rmse, train_rmse, model
    
    else:
        train_preds = model.predict(x_train)
        train_rmse = root_mean_squared_error(y_train, train_preds)

        return mean_cv_rmse, std_cv_rmse, train_rmse

In [None]:
def evaluate_model(
    models, 
    x_train, y_train,
    return_model=False,
    random_state=random_state, 
    **model_params
):

    results = {}
    fitted_models = {}

    for name, model in models.items():
        print(f"Evaluating {name}…")

        if return_model:
            # run_model must accept x_test, y_test, test=True
            mean_cv_rmse, std_cv_rmse, train_rmse, fitted_model = run_model(
                model,
                x_train, y_train,
                return_model=True,
                random_state=random_state,
                **model_params
            )
            fitted_models[name] = fitted_model
        else:
            # only get the three training‐set metrics
            mean_cv_rmse, std_cv_rmse, train_rmse = run_model(
                model,
                x_train, y_train,
                return_model=False,
                random_state=random_state,
                **model_params
            )
        # now format only the things you actually have
        row = {
            'Mean CV RMSE': dollar_format(mean_cv_rmse, 2),
            'STD CV RMSE' : dollar_format(std_cv_rmse, 2),
            'Train RMSE': dollar_format(train_rmse, 2),
        }

        results[name] = row

    df = pd.DataFrame(results).T
    df.index.name = 'Model'

    if return_model:
        return df.sort_values(by='Mean CV RMSE'), fitted_models
    else:
        return df.sort_values(by='Mean CV RMSE')


## Data Transformation

In [None]:
# square footage features have large values and are highly skewed, consider log transformation

# basementsqft, fireplacecnt, garagecarcnt, poolcnt have a lot of zeros (> 75%)), consider transforming them to binary features
zero_list = [
    'basementsqft',
    'fireplacecnt',
    'garagecarcnt',
]
# Check the percentage of zeros in the zero_list features
for col in zero_list:
    zero_percentage = (df[col] == 0).mean() * 100
    print(f"{col}: {zero_percentage:.2f}% zeros")

In [None]:
# log transfermation of square footage features

# define log transformation function
def log_transformation(df, column):
  df[f'{column}_log_transformed'] = np.log(df[column])
  fig, axes = plt.subplots(1, 2, figsize=(8,2))
  axes[0].boxplot(df[column])
  axes[0].set_title(column)
  axes[1].boxplot(df[f'{column}_log_transformed'])
  axes[1].set_title(f'{column}_log')

# apply log transformation to selected columns with large values and high skewness
# (all square footage columns except for basementsqft as it has only 38 non-zero rows)
X_train_log = X_train.copy()
log_list = ['calculatedfinishedsquarefeet', 'finishedsquarefeet12', 'lotsizesquarefeet']

for f in log_list:
    log_transformation(X_train_log, f)

X_train_log.shape

In [None]:
# polynomial transformation

# define polynomial transformation function
def polynomial_transformation(df, column, degree):
  df[f'{column}_poly_transformed'] = df[column]**degree
  fig, axes = plt.subplots(1, 2, figsize=(8,2))
  axes[0].boxplot(df[column])
  axes[0].set_title(column)
  axes[1].boxplot(df[f'{column}_poly_transformed'])
  axes[1].set_title(f'{column}_poly')

# apply polynomial transformation to selected columns
X_train_poly = X_train_log.copy()
poly_list = ['calculatedbathnbr']
for f in poly_list:
  polynomial_transformation(X_train_poly, f, 2)

In [None]:
# encoder for Categorical features

# define one-hot encoding function
def one_hot_encode(df, columns):
    df = df.copy()
    df[columns] = df[columns].astype(str)

    encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
    encoded = encoder.fit_transform(df[columns])
    encoded_df = pd.DataFrame(
        encoded,
        columns=encoder.get_feature_names_out(columns),
        index=df.index
    )

    df = pd.concat([df, encoded_df], axis=1)
    return df

# apply one-hot encoding to selected categorical features (id type features without too many unique values)
encode_list = ['regionidcounty', 'heatingorsystemtypeid', 'propertylandusetypeid' ]
X_train_encoded = one_hot_encode(X_train_poly, encode_list)
X_train_encoded.shape

In [None]:
# convert zero_list features to binary features
def convert_to_binary(df, column):
        df[f'{column}_binary'] = (df[column] != 0).astype(int)

X_train_binary = X_train_encoded.copy()
print(zero_list)
for f in zero_list:
    convert_to_binary(X_train_binary, f)

X_train_binary.shape
print(X_train_binary.info())

In [None]:
# Standardize the transformed features
scaler = StandardScaler()
X_train_transformed_scaled = scaler.fit_transform(X_train_binary)

X_train_transformed_scaled = pd.DataFrame(X_train_transformed_scaled, columns = X_train_binary.columns)

print(X_train_transformed_scaled.shape)
print(y_train.shape)

In [None]:
# create a list of all transformed features
transformed_features = log_list + poly_list + encode_list + zero_list
print(len(transformed_features), transformed_features)

# create a df without the orginal features
X_train_transformed_scaled_original_dropped = X_train_transformed_scaled.drop(columns=transformed_features)
X_train_transformed_scaled_original_dropped.shape

## **Gradient Boosting**

In [None]:
def sweep_parameter(model,
                    Parameters,
                    param,
                    parameter_list,
                    x_train          = X_train_scaled,
                    y_train          = y_train,
                    verbose          = True,
                    show_rmse        = True,
                    n_iter_no_change = None,
                    delta            = 0.001):
    
    start = time.time()
    Parameters = Parameters.copy()  # Avoid modifying the original dictionary
    
    cv_rmses, std_cvs, train_rmses = [], [], []
    no_improve_count = 0
    best_rmse = float('inf')
    
    # Run over each value in parameter_list
    for p in tqdm(parameter_list, desc=f"Sweeping {param}"):
        Parameters[param] = p
        P_temp = Parameters.copy()
        # Remove MSE_found if present, just in case
        P_temp.pop('RMSE_found', None)
        
        cv_rmse, std_cv, train_rmse = run_model(
            model, x_train, y_train, **P_temp
        )
        
        cv_rmses.append(cv_rmse)
        std_cvs.append(std_cv)
        train_rmses.append(train_rmse)
        
        # Early-stopping logic
        if cv_rmse < best_rmse - delta:
            best_rmse = cv_rmse
            no_improve_count = 0
        else:
            no_improve_count += 1
        
        if n_iter_no_change is not None and no_improve_count >= n_iter_no_change:
            print(f"Early stopping: No improvement after {n_iter_no_change} iterations.")
            break
    
    # Identify best parameter
    min_cv_rmse = min(cv_rmses)
    min_index = cv_rmses.index(min_cv_rmse)
    best_param = parameter_list[min_index]
    Parameters[param] = best_param
    Parameters['RMSE_found'] = min_cv_rmse
    
    if verbose:
        # Prepare for plotting
        fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(8, 8), sharex=True)
        
        # We only need as many parameter values as we actually computed
        partial_param_list = parameter_list[:len(cv_rmses)]
        
        # Check if our parameter list is Boolean so we can label accordingly
        is_boolean = all(isinstance(val, bool) for val in partial_param_list)
        if is_boolean:
            # Convert booleans to integer indices for plotting
            x_vals = list(range(len(partial_param_list)))
            x_labels = [str(val) for val in partial_param_list]
        else:
            # Treat numeric or other types as-is
            x_vals = partial_param_list
            x_labels = partial_param_list
        
        error_name = 'RMSE' if show_rmse else 'MSE'
        
        # ----- First plot: (R)MSE -----
        ax1.set_title(f"{error_name} vs {param}")
        
        # Apply dollar formatting ONLY if we're showing RMSE
        if show_rmse:
            ax1.yaxis.set_major_formatter(mticker.FuncFormatter(dollar_format))
        
        # Plot lines
        ax1.plot(x_vals,
                 cv_rmses,
                 marker='.', label=f"CV {error_name}", color='blue')
        ax1.plot(x_vals,
                 train_rmses,
                 marker='.', label=f"Train {error_name}", color='green')
        ax1.scatter([x_vals[min_index]],
                    [min_cv_rmse],
                    marker='x', label=f"Best CV {error_name}", color='red')
        
        ax1.set_ylabel(error_name)
        ax1.legend()
        ax1.grid()
        
        # ----- Second plot: CV Std Dev -----
        ax2.set_title(f"CV Standard Deviation vs {param}")
        ax2.plot(x_vals, std_cvs, marker='.', label=f"CV {error_name} Std", color='blue')
        ax2.set_xlabel(param)
        ax2.set_ylabel("Standard Deviation")
        ax2.legend()
        ax2.grid(alpha=0.5)
        
        # If we are using boolean x-values, set custom ticks
        if is_boolean:
            ax2.set_xticks(x_vals)
            ax2.set_xticklabels(x_labels)
        
        plt.tight_layout()
        plt.show()
        
        end = time.time()
        print("Execution Time:", time.strftime("%H:%M:%S", time.gmtime(end - start)))
    
    return Parameters


In [None]:
Default_Parameters_GradientBoosting = {
    'learning_rate'           : 0.1,             # Shrinks the contribution of each tree. Affects the speed of learning and overfitting.
    'n_estimators'            : 100,             # The number of boosting stages to be run. More estimators can improve performance but increase training time.
    'max_depth'               : 3,               # Maximum depth of individual trees. Controls model complexity.
    'max_features'            : None,            # Number of features to consider when looking for best split. Can help reduce overfitting.
    'random_state'            : 42,              # Controls randomness of boosting. Useful for reproducibility.
    'RMSE_found'               : float('inf')     # NOT a parameter, but will record the MSE found for the current parameter choices
}

In [None]:
Parameters_GB = Default_Parameters_GradientBoosting.copy()
Parameters_GB_list = []

In [None]:
parameters_and_ranges_gb = [
    ('learning_rate', np.linspace(0.1, 0.3, 10)),
    ('n_estimators', range(100, 200, 10)),
    ('max_depth', range(3, 12, 1)),
    ('max_features', np.linspace(0.1, 1.0, 10)) 
]

for param, parameter_list in parameters_and_ranges_gb:
    Parameters_GB = sweep_parameter(GradientBoostingRegressor,
                                Parameters_GB,
                                param,
                                parameter_list,
                                x_train  = X_train_scaled,
                                y_train  = y_train
                                )

    print(f'\nParameter {param} = {Parameters_GB[param]}; RMSE = {dollar_format(Parameters_GB['RMSE_found'],2)}\n')
    print(Parameters_GB_list)

Parameters_GB_list.append(Parameters_GB)
pd.DataFrame(Parameters_GB_list)

In [None]:
# Grab the parameters for the best gradient boosting model
best_gb_params = Parameters_GB_list[0]

# Evaluate the best gradient boosting model
gb_evaluation = evaluate_model({"Gradient Boosting": GradientBoostingRegressor()}, 
                               X_train_transformed_scaled_original_dropped, y_train, **best_gb_params)
gb_evaluation

## Part B: Final Data Science Project Report Assignment [30 pts]

This final report is the culmination of your semester-long Data Science project, building upon the exploratory analyses and modeling milestones you've already completed. Your report should clearly communicate your findings, analysis approach, and conclusions to a technical audience. The following structure and guidelines, informed by best practices, will help you prepare a professional and comprehensive document.

### Required Sections

Your report must include the following sections:


#### 1. Executive Summary (Abstract) [2 pts]
- Brief overview of the entire project (150–200 words)
- Clearly state the objective, approach, and key findings

#### 2. Introduction [2 pts]
- Clearly introduce the topic and context of your project
- Describe the problem you are addressing (the problem statement)
- Clearly state the objectives and goals of your analysis

Note: You may imaginatively consider this project as taking place in a real estate company with a small data science group in-house, and write your introduction from this point of view (don't worry about verisimilitude to an actual company!).  

#### 3. Data Description [2 pts]
- Describe the source of your dataset (described in Milestone 1)
- Clearly state the characteristics of your data (size, types of features, missing values, target, etc.)

#### 4. Methodology (What you did, and why)  [12 pts]

**Focus this section entirely on the steps you took and your reasoning behind them. Emphasize the process and decision-making, not the results themselves**

- Describe your analytical framework 
  - Use of validation curves to see the effect of various hyperparameter choices, and
  - Choice of RMSE as primary error metric
- Clearly outline your data cleaning and preprocessing steps
  - Describe what issues you encountered in the raw data and how you addressed them.
  - Mention any key decisions (e.g., removing samples with too many missing values).
  - What worked and what didn't work?
- Describe your feature engineering approach
  - Explain any transformations, combinations, or derived features.
  - Discuss why certain features were chosen or created, even if they were later discarded.
  - What worked and what didn't work?
- Detail your model selection process 
  - Outline the models you experimented with and why.
  - Discuss how you evaluated generalization (e.g., cross-validation, shape and relationships of plots).
  - Mention how you tuned hyperparameters or selected the final model.



#### 5. Results and Evaluation (What you found, and how well it worked) [10 pts]

**Focus purely on outcomes, with metrics, visuals, and insights. This is where you present evidence to support your conclusions.**

- Provide a clear and detailed narrative of your analysis and reasoning using the analytical approach described in (4). 
- Discuss model performance metrics and results (RMSE, R2, etc.)
- **Include relevant visualizations (graphs, charts, tables) with appropriate labels and captions**
- Error analysis
  - Highlight specific patterns of error, outliers, or questionable features.
  - Note anything surprising or worth improving in future iterations.


#### 6. Conclusion [2 pts]
- Clearly state your main findings and how they address your original objectives
- Highlight the business or practical implications of your findings 
- Discuss the limitations and constraints of your analysis clearly and transparently
- Suggest potential improvements or future directions