# Set Up for Project Imports


In [None]:
# import sys
# from pathlib import Path

In [None]:
# source_directory = Path.cwd()
# ROOT = source_directory.parent
# if str(ROOT) not in sys.path:
#     sys.path.insert(0, str(ROOT))

In [None]:
# # Auto-reload code changes
# %load_ext autoreload
# %autoreload 2

# Imports

In [None]:
from data.api import UcIrvineAPI, UcIrvineDatasetIDs, BureauEconomicAnalysisAPI
import data.wrangling_utils
import pandas
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [None]:
pandas.set_option('display.max_colwidth', None)  # show all text in cells
# pandas.set_option("display.max_rows", 100_000)
pandas.options.mode.copy_on_write = True
pandas.set_option('display.float_format', lambda x: '%.2f' % x)

# UcIrvine Data

In [None]:
uci = UcIrvineAPI.fetch_dataset(repo_id=UcIrvineDatasetIDs.Apartment_For_Rent_Classified.value)
uci_df: pandas.DataFrame = uci.data.original.reset_index()

In [None]:
uci_df.describe()

In [None]:
clean_uci_df: pandas.DataFrame = data.wrangling_utils.clean(uci_df)
clean_uci_df.describe()

In [None]:
clean_uci_df.info()

In [None]:
cleaned_subset_df = clean_uci_df[['bathrooms', 'bedrooms', 'price', 'square_feet']].dropna()
cleaned_subset_df.describe()

Most of the dataset is small-to-moderate homes, but a handful of massive properties

In [None]:
sns.pairplot(cleaned_subset_df, kind='scatter', corner=True)

### Overall Overview

This scatter-matrix visualizes pairwise relationships among bathrooms, bedrooms, price, and square_feet.
It captures both discrete-to-continuous and continuous-to-continuous relationships before any transformations.

#### Bathrooms ↔ Bedrooms

Strong positive association — homes with more bedrooms generally have more bathrooms.
The relationship appears step-like rather than smooth, since both are discrete integer features.
Sparse points for high values (≥6 bedrooms, ≥4 bathrooms) reflect rare, high-end homes.

#### Bathrooms ↔ Square Feet

Clear positive trend — as house size increases, bathroom count rises.
The data form vertical clusters due to rounding in square footage (e.g., multiples of 500 or 1000 ft²).
A few extreme points near 10,000–12,000 ft² expand the scale, showing luxury properties.

#### Bedrooms ↔ Square Feet

Also a strong positive relationship — most homes fall within 2–3 bedrooms and 700–1500 ft².
Vertical and horizontal banding shows how discrete bedroom counts intersect with rounded square footage.
Large, rare homes (8–9 bedrooms) form isolated points in the top-right corner.

#### Price ↔ Square Feet

Clear nonlinear positive pattern — price rises with square footage but with large variability.
Most observations lie below 2000 ft², where prices are tightly clustered.
The upper tail (prices above 20,000) shows a few high-end properties that stretch the scale.

#### Price ↔ Bedrooms / Bathrooms

Moderate, less linear relationships compared to square footage.
Considerable overlap: e.g., 2-bedroom and 3-bedroom homes have overlapping price ranges.
Suggests square footage (continuous) is a stronger predictor of price than room counts (discrete).

#### Axis-Bound Observations

X-axes: discrete groupings (0–9) for bedrooms and bathrooms.
Y-axes: continuous scales for price and square footage.
Outliers: vertical streaks and upper-end clusters represent large, expensive homes that distort the visible scale.

In [None]:
cleaned_subset_df.skew()

### Skewness Analysis
| Feature       | Skew     | Interpretation             | Action               |
| ------------- | -------- |----------------------------|----------------------|
| `bathrooms`   | 0.95     | Slightly right-skewed      | Leave as-is discrete |
| `bedrooms`    | 0.88     | Slightly right-skewed      | Leave as-is discrete |
| `price`       | **9.81** | **Extremely right-skewed** | Must transform       |
| `square_feet` | **3.71** | **Strong right-skewed**    | Should transform     |


Data ranges from 0 to positive values only, so we will use a log-type transformation to reduce skewness.

In [None]:
cleaned_subset_transformed_df = pandas.DataFrame()
cleaned_subset_transformed_df["bathrooms"] = cleaned_subset_df["bathrooms"]
cleaned_subset_transformed_df["bedrooms"] = cleaned_subset_df["bedrooms"]
cleaned_subset_transformed_df["price_log"] = np.log1p(cleaned_subset_df["price"])
cleaned_subset_transformed_df["square_feet_log"] = np.log1p(cleaned_subset_df["square_feet"])
cleaned_subset_transformed_df.skew()

In [None]:
sns.pairplot(cleaned_subset_transformed_df, kind='scatter', corner=True)

### Overall Overview

This scatter-matrix shows pairwise relationships among bathrooms, bedrooms, price_log, and square_feet_log.
After applying log transformations to continuous features, relationships that were previously nonlinear and skewed now appear smoother and more proportional.

#### Bathrooms ↔ Bedrooms

Still a strong positive ordinal relationship — homes with more bedrooms generally include more bathrooms.
The step-like structure remains since both variables are discrete integers.
Higher bedroom/bathroom combinations (≥6 bedrooms, ≥4 bathrooms) remain sparse and scattered — these are rare, high-value homes.

#### Bathrooms ↔ Square Feet (log)

Clear linear relationship: as home size increases, the number of bathrooms grows in roughly proportional steps.
The vertical clustering seen before has reduced; the log scale compresses large square-footage differences.
Outliers are far less extreme, showing improved scaling and distribution balance.

#### Bedrooms ↔ Square Feet (log)

Still a tight positive association with clearer proportionality than in the raw data.
Densest region is centered around 2–3 bedrooms and log(square_feet) ≈ 6.5–7.2, corresponding to ~700–1300 ft².
Larger properties (≥7 bedrooms) appear as upper-end discrete clusters — consistent with rare, luxury homes.

#### Price (log) ↔ Square Feet (log)

Now the strongest and most linear relationship in the matrix.
Points form a clear diagonal cluster — as square_feet_log increases, price_log rises proportionally.
The log transform effectively reduces the influence of extreme prices, revealing a consistent scaling trend.

#### Price (log) ↔ Bedrooms / Bathrooms

Both show positive but weaker relationships than with square footage.
For small to medium homes (1–3 bedrooms), price distributions overlap heavily.
Marginal price gains per additional bedroom or bathroom appear smaller at higher counts — indicating diminishing returns in larger homes.

#### Axis-Bound Observations

x-axes: discrete ranges (0–9) for bedrooms/bathrooms; continuous logs for price and square footage.
y-axes: smoother and compressed compared to the raw plot — extreme high-end outliers no longer dominate.
The clustering along diagonals reflects log-linear scaling, consistent with housing market data patterns.

#### Key Takeaways

Log transformation successfully linearized and stabilized relationships among continuous variables.
price_log and square_feet_log now display a clear proportional relationship ideal for regression or clustering.
Discrete variables (bedrooms, bathrooms) retain ordinal structure and correlate logically with continuous ones.
Data are now better balanced — reduced skew, consistent scales, and fewer distortions.

In [None]:
correlation = cleaned_subset_transformed_df.corr(numeric_only=True)

In [None]:
plt.figure(figsize=(8, 6))
plt.title("Correlation Heatmap", fontsize=14)
sns.heatmap(correlation, annot=True, linewidths=0.5, cmap='mako')
plt.show()

### Variable Types

- Discrete / ordinal: bedrooms, bathrooms — integer counts
- Continuous: price_log, square_feet_log — continuous and normalized
- We used Pearson correlation, the relationships involving discrete counts are approximate linear associations, not strict parametric correlations — but they’re still informative here since the discrete values are ordered and range reasonably (0–9).


| Pair                            | Correlation | Interpretation                                                                                                                                            |
| ------------------------------- | ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **bathrooms ↔ bedrooms**        | **0.66**    | Strong positive relationship — homes with more bedrooms usually have more bathrooms.                                                                      |
| **bathrooms ↔ square_feet_log** | **0.70**    | Strong positive correlation — larger houses naturally have more bathrooms.                                                                                |
| **bedrooms ↔ square_feet_log**  | **0.70**    | Same strong pattern — larger homes have more bedrooms.                                                                                                    |
| **price_log ↔ square_feet_log** | **0.40**    | Moderate positive relationship — price generally rises with size, but not perfectly (other factors matter).                                               |
| **price_log ↔ bathrooms**       | **0.34**    | Mild correlation — price increases somewhat with bathroom count, but not linearly.                                                                        |
| **price_log ↔ bedrooms**        | **0.25**    | Weak correlation — price doesn’t increase as predictably with bedroom count, possibly because extra bedrooms add less marginal value than square footage. |


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 5), sharex=True, sharey=True)

# --- Price vs Bedrooms ---
sns.violinplot(
    data=cleaned_subset_transformed_df, x="bedrooms", y="price_log",
    inner=None, color=".8", ax=axes[0]
)
sns.boxplot(
    data=cleaned_subset_transformed_df, x="bedrooms", y="price_log",
    width=0.2, ax=axes[0]
)
axes[0].set_title("Price (log) by Bedrooms")
axes[0].set_xlabel("Bedrooms")
axes[0].set_ylabel("Price (log)")

# --- Square Feet vs Bedrooms ---
sns.violinplot(
    data=cleaned_subset_transformed_df, x="bedrooms", y="square_feet_log",
    inner=None, color=".8", ax=axes[1]
)
sns.boxplot(
    data=cleaned_subset_transformed_df, x="bedrooms", y="square_feet_log",
    width=0.2, ax=axes[1]
)
axes[1].set_title("Square Feet (log) by Bedrooms")
axes[1].set_xlabel("Bedrooms")
axes[1].set_ylabel("Square Feet (log)")

plt.tight_layout()
plt.show()


### Overall Overview

These plots compare how both price (log) and square feet (log) vary across bedroom counts.
Each violin shows the full distribution shape, while the overlaid boxplots display median and interquartile range (IQR).
Both are plotted with the same x-axis to make scale and distribution directly comparable.

#### Price (log) by Bedrooms — Left Plot

Clear positive association between bedroom count and log-price.
The median price gradually increases up to about 6 bedrooms.
Distribution width (spread) increases with bedrooms → larger homes show more price variability.
For higher bedroom counts (≥ 7), distributions narrow sharply — indicating rarer, high-value homes with consistent pricing.
Bimodal hints at 2–3 bedrooms suggest different market tiers within typical homes.

#### Square Feet (log) by Bedrooms — Right Plot

Strong linear growth of home size with bedroom count.
Distributions are more compact than price, implying size scales more predictably than value.
Variability increases slightly through 5–6 bedrooms, then tightens again for the largest homes.
The near-parallel rise of median lines across bedroom categories reinforces the expected structural pattern:
more bedrooms → larger homes → higher price.

#### Comparative Observations

| Feature               | Trend with Bedrooms    | Spread / Variability | Notes                                                    |
| --------------------- | ---------------------- |----------------------|----------------------------------------------------------|
| **Price (log)**       | Increases non-linearly | Widens then narrows  | Market heterogeneity — location & luxury drive variance  |
| **Square Feet (log)** | Increases linearly     | Slightly widens      | Structural scaling — bedroom count reflects total area   |


#### Key Takeaways

Bedroom count correlates strongly with both price and size, but price shows greater volatility due to external factors (location, condition, amenities).
The log transformation stabilized variance and reduced extreme skew, making growth patterns clearer.
Distributions beyond 6 bedrooms are based on few samples — treat them as illustrative rather than statistically robust.
Violin + box overlays effectively communicate both distribution shape and summary statistics for ordinal categories.

In [None]:
# DO NOT DELETE MIGHT NEED
# pandas.set_option("display.max_rows", 100_000)  # TOGGLE  UN/COMMENT
# pandas.reset_option("display.max_rows") # TOGGLE UN/COMMENT
# cleaned_uci_df['cityname'].value_counts(dropna=False)  # change column

In [None]:
# DO NOT DELETE MIGHT NEED
# import json
# s = cleaned_uci_df['bathrooms'].explode()
# global_counts = s.value_counts().to_dict()
# global_counts
# print(f'BAD_DATA: {json.dumps(BAD_DATA['cityname'], indent=2)}')  # CHANGE COL
# uci_df["state_full"] = uci_df["state"].str.upper().map(STATE_MAP)
# print(uci_df.shape)
# uci_df.dropna(subset=["state_full"], inplace=True)
# uci_df.shape

# Bureau of Economic Data

In [None]:
bea_df = BureauEconomicAnalysisAPI.fetch_dataset('Regional', GeoFips='STATE', TableName='SARPP', Year='2019',
                                                 LineCode='1')