### Initial Data Assessment

- Number of rows: …
- Number of columns: …
- Main continuous numeric variables: …
- Main discrete numeric variables: …
- Main categorical variables: …
- Columns with high missing percentage: …
- Potential issues observed:
  - …
  - …
- First ideas for cleaning and standardization:
  - …
  - …

In [1]:
print("Step 1 – Import core libraries")

import numpy as np
import pandas as pd

Step 1 – Import core libraries


In [2]:
print("Step 2 – Load raw AmesHousing dataset")

df = pd.read_csv("../data/raw/AmesHousing.csv")
df.head()

Step 2 – Load raw AmesHousing dataset


Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


In [3]:
print("Step 3 – Basic dataset overview: shape, columns, dtypes")

print("\nShape (rows, columns):")
print(df.shape)

print("\nFirst 5 columns:")
print(df.columns[:5])

print("\nData types (dtypes):")
print(df.dtypes.head(15))

Step 3 – Basic dataset overview: shape, columns, dtypes

Shape (rows, columns):
(2930, 82)

First 5 columns:
Index(['Order', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage'], dtype='object')

Data types (dtypes):
Order             int64
PID               int64
MS SubClass       int64
MS Zoning        object
Lot Frontage    float64
Lot Area          int64
Street           object
Alley            object
Lot Shape        object
Land Contour     object
Utilities        object
Lot Config       object
Land Slope       object
Neighborhood     object
Condition 1      object
dtype: object


In [4]:
print("Step 4 – Missing values overview (absolute and percentage for top 20 columns)")

missing_abs = df.isna().sum()
missing_pct = (missing_abs / len(df)) * 100

missing_summary = (
    pd.DataFrame({
        "missing_count": missing_abs,
        "missing_pct": missing_pct
    })
    .sort_values("missing_pct", ascending=False)
    .head(20)
)

missing_summary

Step 4 – Missing values overview (absolute and percentage for top 20 columns)


Unnamed: 0,missing_count,missing_pct
Pool QC,2917,99.556314
Misc Feature,2824,96.382253
Alley,2732,93.242321
Fence,2358,80.477816
Mas Vnr Type,1775,60.580205
Fireplace Qu,1422,48.532423
Lot Frontage,490,16.723549
Garage Cond,159,5.426621
Garage Finish,159,5.426621
Garage Yr Blt,159,5.426621


In [5]:
print("Step 5 – Split columns into numeric and categorical")

numeric_cols = df.select_dtypes(include=np.number).columns.tolist()
categorical_cols = df.select_dtypes(exclude=np.number).columns.tolist()

print(f"Total numeric columns: {len(numeric_cols)}")
print(f"Total categorical columns: {len(categorical_cols)}")

print("\nFirst 10 numeric columns:")
print(numeric_cols[:10])

print("\nFirst 10 categorical columns:")
print(categorical_cols[:10])

Step 5 – Split columns into numeric and categorical
Total numeric columns: 39
Total categorical columns: 43

First 10 numeric columns:
['Order', 'PID', 'MS SubClass', 'Lot Frontage', 'Lot Area', 'Overall Qual', 'Overall Cond', 'Year Built', 'Year Remod/Add', 'Mas Vnr Area']

First 10 categorical columns:
['MS Zoning', 'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities', 'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1']


In [6]:
print("Step 6 – Split numeric variables into continuous and discrete (based on nunique threshold)")

nunique_numeric = df[numeric_cols].nunique()

discrete_num_cols = nunique_numeric[nunique_numeric <= 20].index.tolist()
continuous_num_cols = nunique_numeric[nunique_numeric > 20].index.tolist()

print(f"Numeric continuous columns: {len(continuous_num_cols)}")
print(f"Numeric discrete columns:   {len(discrete_num_cols)}")

print("\nExamples of continuous numeric columns:")
print(continuous_num_cols[:10])

print("\nExamples of discrete numeric columns:")
print(discrete_num_cols[:10])

Step 6 – Split numeric variables into continuous and discrete (based on nunique threshold)
Numeric continuous columns: 24
Numeric discrete columns:   15

Examples of continuous numeric columns:
['Order', 'PID', 'Lot Frontage', 'Lot Area', 'Year Built', 'Year Remod/Add', 'Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF']

Examples of discrete numeric columns:
['MS SubClass', 'Overall Qual', 'Overall Cond', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'TotRms AbvGrd']


In [7]:
print("Step 7 – Summary statistics for continuous numeric variables")

cont_summary = df[continuous_num_cols].describe().T
cont_summary

Step 7 – Summary statistics for continuous numeric variables


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Order,2930.0,1465.5,845.9625,1.0,733.25,1465.5,2197.75,2930.0
PID,2930.0,714464500.0,188730800.0,526301100.0,528477000.0,535453620.0,907181100.0,1007100000.0
Lot Frontage,2440.0,69.22459,23.36533,21.0,58.0,68.0,80.0,313.0
Lot Area,2930.0,10147.92,7880.018,1300.0,7440.25,9436.5,11555.25,215245.0
Year Built,2930.0,1971.356,30.24536,1872.0,1954.0,1973.0,2001.0,2010.0
Year Remod/Add,2930.0,1984.267,20.86029,1950.0,1965.0,1993.0,2004.0,2010.0
Mas Vnr Area,2907.0,101.8968,179.1126,0.0,0.0,0.0,164.0,1600.0
BsmtFin SF 1,2929.0,442.6296,455.5908,0.0,0.0,370.0,734.0,5644.0
BsmtFin SF 2,2929.0,49.72243,169.1685,0.0,0.0,0.0,0.0,1526.0
Bsmt Unf SF,2929.0,559.2625,439.4942,0.0,219.0,466.0,802.0,2336.0


In [8]:
print("Step 8 – Summary statistics for discrete numeric variables")

disc_summary = df[discrete_num_cols].describe().T
disc_summary

Step 8 – Summary statistics for discrete numeric variables


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MS SubClass,2930.0,57.387372,42.638025,20.0,20.0,50.0,70.0,190.0
Overall Qual,2930.0,6.094881,1.411026,1.0,5.0,6.0,7.0,10.0
Overall Cond,2930.0,5.56314,1.111537,1.0,5.0,5.0,6.0,9.0
Bsmt Full Bath,2928.0,0.431352,0.52482,0.0,0.0,0.0,1.0,3.0
Bsmt Half Bath,2928.0,0.061134,0.245254,0.0,0.0,0.0,0.0,2.0
Full Bath,2930.0,1.566553,0.552941,0.0,1.0,2.0,2.0,4.0
Half Bath,2930.0,0.379522,0.502629,0.0,0.0,0.0,1.0,2.0
Bedroom AbvGr,2930.0,2.854266,0.827731,0.0,2.0,3.0,3.0,8.0
Kitchen AbvGr,2930.0,1.044369,0.214076,0.0,1.0,1.0,1.0,3.0
TotRms AbvGrd,2930.0,6.443003,1.572964,2.0,5.0,6.0,7.0,15.0


In [9]:
print("Step 9 – Descriptive stats for categorical variables (count, unique, top, freq)")

cat_summary = df[categorical_cols].describe().T
cat_summary

Step 9 – Descriptive stats for categorical variables (count, unique, top, freq)


Unnamed: 0,count,unique,top,freq
MS Zoning,2930,7,RL,2273
Street,2930,2,Pave,2918
Alley,198,2,Grvl,120
Lot Shape,2930,4,Reg,1859
Land Contour,2930,4,Lvl,2633
Utilities,2930,3,AllPub,2927
Lot Config,2930,5,Inside,2140
Land Slope,2930,3,Gtl,2789
Neighborhood,2930,28,NAmes,443
Condition 1,2930,9,Norm,2522


In [10]:
print("Step 10 – Cardinality of categorical variables (sorted)")

cat_cardinality = df[categorical_cols].nunique().sort_values()
cat_cardinality

Step 10 – Cardinality of categorical variables (sorted)


Street             2
Alley              2
Central Air        2
Garage Finish      3
Utilities          3
Paved Drive        3
Land Slope         3
Fence              4
Pool QC            4
Exter Qual         4
Bsmt Exposure      4
Land Contour       4
Lot Shape          4
Mas Vnr Type       4
Bsmt Cond          5
Electrical         5
Kitchen Qual       5
Fireplace Qu       5
Bsmt Qual          5
Garage Cond        5
Garage Qual        5
Bldg Type          5
Misc Feature       5
Lot Config         5
Exter Cond         5
Heating QC         5
Garage Type        6
Sale Condition     6
BsmtFin Type 2     6
BsmtFin Type 1     6
Foundation         6
Roof Style         6
Heating            6
MS Zoning          7
Functional         8
Roof Matl          8
House Style        8
Condition 2        8
Condition 1        9
Sale Type         10
Exterior 1st      16
Exterior 2nd      17
Neighborhood      28
dtype: int64

In [11]:
print("Step 11 – Check for duplicate rows and low-variance columns")

# Linhas duplicadas
dup_count = df.duplicated().sum()
print(f"\nNumber of duplicated rows: {dup_count}")

# Colunas com 1 único valor (quase inúteis)
nunique_all = df.nunique()
low_variance_cols = nunique_all[nunique_all == 1].index.tolist()
print(f"\nColumns with only one unique value: {len(low_variance_cols)}")
print(low_variance_cols)

Step 11 – Check for duplicate rows and low-variance columns

Number of duplicated rows: 0

Columns with only one unique value: 0
[]


In [12]:
print("Step 12 – Random sample of 10 rows for manual inspection")

df.sample(10, random_state=42)

Step 12 – Random sample of 10 rows for manual inspection


Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
1357,1358,903427090,70,RM,,5100,Pave,Grvl,Reg,Lvl,...,0,,MnPrv,,0,6,2008,WD,Normal,161000
2367,2368,527450460,160,RM,21.0,1890,Pave,,Reg,Lvl,...,0,,,,0,7,2006,WD,Normal,116000
2822,2823,908128100,60,RL,62.0,7162,Pave,,Reg,Lvl,...,0,,,,0,5,2006,WD,Normal,196500
2126,2127,907135180,20,RL,60.0,8070,Pave,,Reg,Lvl,...,0,,,,0,8,2007,WD,Normal,123600
1544,1545,910200080,30,RM,50.0,7000,Pave,,Reg,Lvl,...,0,,MnPrv,,0,7,2008,WD,Normal,126000
2415,2416,528221010,20,RL,102.0,11660,Pave,,IR1,Lvl,...,0,,,,0,7,2006,New,Partial,174190
2227,2228,909455060,120,RM,35.0,3907,Pave,,IR1,Bnk,...,0,,,,0,3,2007,WD,Normal,200000
410,411,527453060,160,RL,24.0,2280,Pave,,Reg,Lvl,...,0,,,,0,7,2009,WD,Normal,148500
761,762,904100190,20,RL,50.0,4280,Pave,,IR1,Lvl,...,0,,,,0,9,2009,WD,Normal,88750
436,437,528118060,60,RL,59.0,23303,Pave,,IR3,Lvl,...,0,,,,0,6,2009,WD,Family,409900


In [13]:
df.columns

Index(['Order', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area',
       'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remod/Add', 'Roof Style',
       'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type',
       'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual',
       'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1',
       'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
       'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF',
       '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath',
       'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr',
       'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional',
       'Fireplaces', 'Fireplace Qu', 'Garage Type', 'Garage Yr Blt',
      

In [14]:
print("=== Variable type inspection ===")

import numpy as np
import pandas as pd

# 1. Split numeric vs categorical
print("\n[Step 1] Splitting numeric and categorical columns...")

numeric_cols = df.select_dtypes(include=np.number).columns.tolist()
categorical_cols = df.select_dtypes(exclude=np.number).columns.tolist()

print(f"Total numeric columns: {len(numeric_cols)}")
print(f"Total categorical columns: {len(categorical_cols)}")

# 2. Among numeric, split into 'continuous' and 'discrete' by nunique threshold
print("\n[Step 2] Splitting numeric columns into continuous vs discrete based on nunique <= 20...")

nunique_numeric = df[numeric_cols].nunique()

discrete_num_cols = nunique_numeric[nunique_numeric <= 20].index.tolist()
continuous_num_cols = nunique_numeric[nunique_numeric > 20].index.tolist()

print(f"Numeric continuous columns: {len(continuous_num_cols)}")
print(f"Numeric discrete columns:   {len(discrete_num_cols)}")

# 3. Show the full lists for review
print("\n[Output] Numeric continuous columns:")
print(continuous_num_cols)

print("\n[Output] Numeric discrete columns:")
print(discrete_num_cols)

print("\n[Output] Categorical columns:")
print(categorical_cols)

=== Variable type inspection ===

[Step 1] Splitting numeric and categorical columns...
Total numeric columns: 39
Total categorical columns: 43

[Step 2] Splitting numeric columns into continuous vs discrete based on nunique <= 20...
Numeric continuous columns: 24
Numeric discrete columns:   15

[Output] Numeric continuous columns:
['Order', 'PID', 'Lot Frontage', 'Lot Area', 'Year Built', 'Year Remod/Add', 'Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Garage Yr Blt', 'Garage Area', 'Wood Deck SF', 'Open Porch SF', 'Enclosed Porch', '3Ssn Porch', 'Screen Porch', 'Misc Val', 'SalePrice']

[Output] Numeric discrete columns:
['MS SubClass', 'Overall Qual', 'Overall Cond', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'TotRms AbvGrd', 'Fireplaces', 'Garage Cars', 'Pool Area', 'Mo Sold', 'Yr Sold']

[Output] Categorical columns:
['MS Zoning', 'St

In [15]:
print(
    "Semantic Type Correction:\n"
    "After a manual, domain-informed review of the Ames Housing variable names "
    "and the official dataset documentation, we updated the automatic dtype/nunique-based "
    "classification to reflect the true semantic type of each column (continuous, discrete, "
    "ordinal, nominal, temporal, ID)."
)

# 1. Corrected groups ----------------------------------------------------------

id_cols = [
    "Order",
    "PID",
]

temporal_cols = [
    "Year Built",
    "Year Remod/Add",
    "Garage Yr Blt",
    "Yr Sold",
]

numeric_continuous_cols = [
    "Lot Frontage",
    "Lot Area",
    "Mas Vnr Area",
    "BsmtFin SF 1",
    "BsmtFin SF 2",
    "Bsmt Unf SF",
    "Total Bsmt SF",
    "1st Flr SF",
    "2nd Flr SF",
    "Low Qual Fin SF",
    "Gr Liv Area",
    "Garage Area",
    "Wood Deck SF",
    "Open Porch SF",
    "Enclosed Porch",
    "3Ssn Porch",
    "Screen Porch",
    "Pool Area",
    "Misc Val",
    "SalePrice",
]

numeric_discrete_cols = [
    "Bsmt Full Bath",
    "Bsmt Half Bath",
    "Full Bath",
    "Half Bath",
    "Bedroom AbvGr",
    "Kitchen AbvGr",
    "TotRms AbvGrd",
    "Fireplaces",
    "Garage Cars",
]

categorical_ordinal_cols = [
    "Overall Qual",
    "Overall Cond",
    "Exter Qual",
    "Exter Cond",
    "Bsmt Qual",
    "Bsmt Cond",
    "Bsmt Exposure",
    "BsmtFin Type 1",
    "BsmtFin Type 2",
    "Heating QC",
    "Kitchen Qual",
    "Fireplace Qu",
    "Garage Qual",
    "Garage Cond",
    "Pool QC",
    "Fence",
    "Functional",
]

categorical_nominal_cols = [
    "MS SubClass",
    "MS Zoning",
    "Street",
    "Alley",
    "Lot Shape",
    "Land Contour",
    "Utilities",
    "Lot Config",
    "Land Slope",
    "Neighborhood",
    "Condition 1",
    "Condition 2",
    "Bldg Type",
    "House Style",
    "Roof Style",
    "Roof Matl",
    "Exterior 1st",
    "Exterior 2nd",
    "Mas Vnr Type",
    "Foundation",
    "Heating",
    "Central Air",
    "Electrical",
    "Garage Type",
    "Garage Finish",
    "Paved Drive",
    "Misc Feature",
    "Sale Type",
    "Sale Condition",
    "Mo Sold",
]

# 2. Group aggregations --------------------------------------------------------

numeric_cols = numeric_continuous_cols + numeric_discrete_cols + temporal_cols
categorical_cols = categorical_nominal_cols + categorical_ordinal_cols

# 3. Consistency checks --------------------------------------------------------

all_grouped = set(
    id_cols
    + temporal_cols
    + numeric_continuous_cols
    + numeric_discrete_cols
    + categorical_ordinal_cols
    + categorical_nominal_cols
)

all_columns = set(df.columns)

missing_in_groups = all_columns - all_grouped
extra_in_groups = all_grouped - all_columns

print("\n=== Updated Variable Grouping (Semantic Correction Applied) ===\n")

print(f"Total columns in df:              {len(all_columns)}")
print(f"ID columns:                       {len(id_cols)}")
print(f"Temporal columns:                 {len(temporal_cols)}")
print(f"Numeric continuous columns:       {len(numeric_continuous_cols)}")
print(f"Numeric discrete columns:         {len(numeric_discrete_cols)}")
print(f"Ordinal categorical columns:      {len(categorical_ordinal_cols)}")
print(f"Nominal categorical columns:      {len(categorical_nominal_cols)}")
print(f"Total numeric (incl. temporal):   {len(numeric_cols)}")
print(f"Total categorical:                {len(categorical_cols)}")

print("\nColumns present in df but NOT in any group (should be empty):")
print(missing_in_groups)

print("\nColumns listed in groups but NOT found in df (should be empty):")
print(extra_in_groups)

Semantic Type Correction:
After a manual, domain-informed review of the Ames Housing variable names and the official dataset documentation, we updated the automatic dtype/nunique-based classification to reflect the true semantic type of each column (continuous, discrete, ordinal, nominal, temporal, ID).

=== Updated Variable Grouping (Semantic Correction Applied) ===

Total columns in df:              82
ID columns:                       2
Temporal columns:                 4
Numeric continuous columns:       20
Numeric discrete columns:         9
Ordinal categorical columns:      17
Nominal categorical columns:      30
Total numeric (incl. temporal):   33
Total categorical:                47

Columns present in df but NOT in any group (should be empty):
set()

Columns listed in groups but NOT found in df (should be empty):
set()


In [16]:
VARIABLE_DICTIONARY = {
    "Order": {
        "meaning": "Row Order",
        "description": "Row number indicating the observation order in the original dataset."
    },
    "PID": {
        "meaning": "Parcel Identification Number",
        "description": "Unique identifier assigned to each property parcel."
    },
    "MS SubClass": {
        "meaning": "Building Class",
        "description": "Identifies the type of dwelling (e.g., 1-story, 2-story, split-level)."
    },
    "MS Zoning": {
        "meaning": "Zoning Classification",
        "description": "General zoning classification of the property (e.g., residential, commercial)."
    },
    "Lot Frontage": {
        "meaning": "Lot Frontage",
        "description": "Linear feet of street connected to the property."
    },
    "Lot Area": {
        "meaning": "Lot Area",
        "description": "Total square footage of the lot."
    },
    "Street": {
        "meaning": "Street Type",
        "description": "Type of street access (paved, gravel, etc.)."
    },
    "Alley": {
        "meaning": "Alley Access",
        "description": "Type of alley access to the property."
    },
    "Lot Shape": {
        "meaning": "Lot Shape",
        "description": "Overall shape of the property (regular, irregular, etc.)."
    },
    "Land Contour": {
        "meaning": "Land Contour",
        "description": "Flatness and general contour of the property."
    },
    "Utilities": {
        "meaning": "Utilities",
        "description": "Type of utilities available (gas, electricity, sewage)."
    },
    "Lot Config": {
        "meaning": "Lot Configuration",
        "description": "Property configuration (inside lot, corner lot, cul-de-sac, etc.)."
    },
    "Land Slope": {
        "meaning": "Land Slope",
        "description": "Slope of the property (gentle, moderate, severe)."
    },
    "Neighborhood": {
        "meaning": "Neighborhood",
        "description": "Physical location within the city of Ames."
    },
    "Condition 1": {
        "meaning": "Primary Proximity Condition",
        "description": "Proximity to main road, railroad, or other environmental factors."
    },
    "Condition 2": {
        "meaning": "Secondary Proximity Condition",
        "description": "Secondary proximity to environmental features."
    },
    "Bldg Type": {
        "meaning": "Building Type",
        "description": "Type of dwelling (single-family, duplex, townhouse, etc.)."
    },
    "House Style": {
        "meaning": "House Style",
        "description": "Architectural style of the dwelling (1-story, 2-story, split-level, etc.)."
    },
    "Overall Qual": {
        "meaning": "Overall Quality",
        "description": "Overall material and finish quality, rated from 1 (lowest) to 10 (highest)."
    },
    "Overall Cond": {
        "meaning": "Overall Condition",
        "description": "Overall physical condition of the house, rated from 1 (poor) to 10 (excellent)."
    },
    "Year Built": {
        "meaning": "Construction Year",
        "description": "Original construction year of the property."
    },
    "Year Remod/Add": {
        "meaning": "Remodel Year",
        "description": "Year of last remodeling or addition; same as Year Built if no remodeling occurred."
    },
    "Roof Style": {
        "meaning": "Roof Style",
        "description": "Type of roof design (gable, hip, flat, etc.)."
    },
    "Roof Matl": {
        "meaning": "Roof Material",
        "description": "Primary roof covering material."
    },
    "Exterior 1st": {
        "meaning": "Exterior Covering (Primary)",
        "description": "Primary exterior material covering the house."
    },
    "Exterior 2nd": {
        "meaning": "Exterior Covering (Secondary)",
        "description": "Secondary exterior material (if multiple materials are used)."
    },
    "Mas Vnr Type": {
        "meaning": "Masonry Veneer Type",
        "description": "Type of masonry veneer applied to the exterior."
    },
    "Mas Vnr Area": {
        "meaning": "Masonry Veneer Area",
        "description": "Area of masonry veneer in square feet."
    },
    "Exter Qual": {
        "meaning": "Exterior Quality",
        "description": "Quality of the material on the exterior (ordinal: Po < Fa < TA < Gd < Ex)."
    },
    "Exter Cond": {
        "meaning": "Exterior Condition",
        "description": "Condition of the exterior material (ordinal scale)."
    },
    "Foundation": {
        "meaning": "Foundation Type",
        "description": "Type of foundation (crawl space, slab, basement, etc.)."
    },
    "Bsmt Qual": {
        "meaning": "Basement Quality",
        "description": "Height and quality of the basement (ordinal)."
    },
    "Bsmt Cond": {
        "meaning": "Basement Condition",
        "description": "Overall condition of the basement (ordinal)."
    },
    "Bsmt Exposure": {
        "meaning": "Basement Exposure",
        "description": "Walkout or garden-level basement exposure (ordinal)."
    },
    "BsmtFin Type 1": {
        "meaning": "Basement Finish Type 1",
        "description": "Primary basement finished area rating (ordinal)."
    },
    "BsmtFin SF 1": {
        "meaning": "Basement Finished Sq Ft 1",
        "description": "Finished square footage of primary basement area."
    },
    "BsmtFin Type 2": {
        "meaning": "Basement Finish Type 2",
        "description": "Secondary finished area rating (ordinal)."
    },
    "BsmtFin SF 2": {
        "meaning": "Basement Finished Sq Ft 2",
        "description": "Finished square footage of secondary basement area."
    },
    "Bsmt Unf SF": {
        "meaning": "Basement Unfinished Area",
        "description": "Unfinished square footage of the basement."
    },
    "Total Bsmt SF": {
        "meaning": "Total Basement Area",
        "description": "Total square footage of the basement (finished + unfinished)."
    },
    "Heating": {
        "meaning": "Heating System",
        "description": "Type of heating installed in the property."
    },
    "Heating QC": {
        "meaning": "Heating Quality and Condition",
        "description": "Quality and condition of the heating system (ordinal)."
    },
    "Central Air": {
        "meaning": "Central Air Conditioning",
        "description": "Whether the house has central AC (Y/N)."
    },
    "Electrical": {
        "meaning": "Electrical System",
        "description": "Electrical wiring type."
    },
    "1st Flr SF": {
        "meaning": "First Floor Area",
        "description": "Square footage of the first floor."
    },
    "2nd Flr SF": {
        "meaning": "Second Floor Area",
        "description": "Square footage of the second floor."
    },
    "Low Qual Fin SF": {
        "meaning": "Low Quality Finished Area",
        "description": "Finished area of low quality (all floors)."
    },
    "Gr Liv Area": {
        "meaning": "Ground Living Area",
        "description": "Above-ground living area in square feet."
    },
    "Bsmt Full Bath": {
        "meaning": "Basement Full Bathrooms",
        "description": "Number of full bathrooms located in the basement."
    },
    "Bsmt Half Bath": {
        "meaning": "Basement Half Bathrooms",
        "description": "Number of half bathrooms located in the basement."
    },
    "Full Bath": {
        "meaning": "Full Bathrooms (Above Grade)",
        "description": "Number of full bathrooms above grade."
    },
    "Half Bath": {
        "meaning": "Half Bathrooms (Above Grade)",
        "description": "Number of half bathrooms above grade."
    },
    "Bedroom AbvGr": {
        "meaning": "Bedrooms Above Grade",
        "description": "Number of bedrooms located above ground level."
    },
    "Kitchen AbvGr": {
        "meaning": "Kitchens Above Grade",
        "description": "Number of kitchens located above ground level."
    },
    "Kitchen Qual": {
        "meaning": "Kitchen Quality",
        "description": "Quality of the kitchen (ordinal)."
    },
    "TotRms AbvGrd": {
        "meaning": "Total Rooms Above Grade",
        "description": "Total number of rooms above grade, excluding bathrooms."
    },
    "Functional": {
        "meaning": "Home Functionality",
        "description": "Overall functionality of the home (ordinal)."
    },
    "Fireplaces": {
        "meaning": "Fireplaces",
        "description": "Number of fireplaces in the property."
    },
    "Fireplace Qu": {
        "meaning": "Fireplace Quality",
        "description": "Quality of the fireplace (ordinal)."
    },
    "Garage Type": {
        "meaning": "Garage Location",
        "description": "Location of the garage relative to the home."
    },
    "Garage Yr Blt": {
        "meaning": "Garage Construction Year",
        "description": "Original year the garage was constructed."
    },
    "Garage Finish": {
        "meaning": "Garage Interior Finish",
        "description": "Interior finish of the garage (finished, unfinished, rough)."
    },
    "Garage Cars": {
        "meaning": "Garage Capacity",
        "description": "Number of cars the garage can accommodate."
    },
    "Garage Area": {
        "meaning": "Garage Area",
        "description": "Square footage of the garage."
    },
    "Garage Qual": {
        "meaning": "Garage Quality",
        "description": "Quality of the garage (ordinal)."
    },
    "Garage Cond": {
        "meaning": "Garage Condition",
        "description": "Condition of the garage (ordinal)."
    },
    "Paved Drive": {
        "meaning": "Paved Driveway",
        "description": "Whether the driveway is paved (Y/N/P)."
    },
    "Wood Deck SF": {
        "meaning": "Wood Deck Area",
        "description": "Square footage of wood deck area."
    },
    "Open Porch SF": {
        "meaning": "Open Porch Area",
        "description": "Square footage of the open porch."
    },
    "Enclosed Porch": {
        "meaning": "Enclosed Porch Area",
        "description": "Square footage of enclosed porch area."
    },
    "3Ssn Porch": {
        "meaning": "Three-Season Porch Area",
        "description": "Square footage of three-season porch area."
    },
    "Screen Porch": {
        "meaning": "Screen Porch Area",
        "description": "Square footage of screened porch area."
    },
    "Pool Area": {
        "meaning": "Pool Area",
        "description": "Total pool area in square feet."
    },
    "Pool QC": {
        "meaning": "Pool Quality",
        "description": "Quality of the pool (ordinal)."
    },
    "Fence": {
        "meaning": "Fence Quality",
        "description": "Quality of the fence (ordinal)."
    },
    "Misc Feature": {
        "meaning": "Miscellaneous Feature",
        "description": "Miscellaneous feature not covered in other categories."
    },
    "Misc Val": {
        "meaning": "Miscellaneous Value",
        "description": "Value of the miscellaneous feature in USD."
    },
    "Mo Sold": {
        "meaning": "Month Sold",
        "description": "Month the property was sold."
    },
    "Yr Sold": {
        "meaning": "Year Sold",
        "description": "Year the property was sold."
    },
    "Sale Type": {
        "meaning": "Sale Type",
        "description": "Type of sale (normal, auction, family, etc.)."
    },
    "Sale Condition": {
        "meaning": "Sale Condition",
        "description": "Condition of the sale (normal, partial, abnormal)."
    },
    "SalePrice": {
        "meaning": "Sale Price",
        "description": "Final sale price of the property in USD (this is the target variable)."
    }
}

In [17]:
print("Building complete, enriched variable dictionary...\n")

# ---------------------------------------------------------
# Helper: determine semantic group for any column
# ---------------------------------------------------------

def get_semantic_group(col: str) -> str:
    if col in id_cols:
        return "id"
    if col in temporal_cols:
        return "temporal"
    if col in numeric_continuous_cols:
        return "numeric_continuous"
    if col in numeric_discrete_cols:
        return "numeric_discrete"
    if col in categorical_ordinal_cols:
        return "categorical_ordinal"
    if col in categorical_nominal_cols:
        return "categorical_nominal"
    return "unknown"


# ---------------------------------------------------------
# Build full enriched dictionary
# ---------------------------------------------------------

FULL_VARIABLE_DICT = {}

for col in df.columns:
    series = df[col]
    
    meaning = VARIABLE_DICTIONARY[col]["meaning"] if col in VARIABLE_DICTIONARY else None
    description = VARIABLE_DICTIONARY[col]["description"] if col in VARIABLE_DICTIONARY else None
    
    semantic_group = get_semantic_group(col)
    dtype_str = str(series.dtype)
    
    n_missing = int(series.isna().sum())
    missing_pct = float(series.isna().mean() * 100)
    
    n_unique = int(series.nunique(dropna=True))
    example_values = series.dropna().unique()[:5].tolist()
    
    FULL_VARIABLE_DICT[col] = {
        "meaning": meaning,
        "description": description,
        "semantic_group": semantic_group,
        "pandas_dtype": dtype_str,
        "n_missing": n_missing,
        "missing_pct": round(missing_pct, 3),
        "n_unique": n_unique,
        "example_values": example_values
    }

print("Full dictionary successfully created!")
print(f"Total variables: {len(FULL_VARIABLE_DICT)}")

# Preview first 5 entries
list(FULL_VARIABLE_DICT.items())[:5]

Building complete, enriched variable dictionary...

Full dictionary successfully created!
Total variables: 82


[('Order',
  {'meaning': 'Row Order',
   'description': 'Row number indicating the observation order in the original dataset.',
   'semantic_group': 'id',
   'pandas_dtype': 'int64',
   'n_missing': 0,
   'missing_pct': 0.0,
   'n_unique': 2930,
   'example_values': [1, 2, 3, 4, 5]}),
 ('PID',
  {'meaning': 'Parcel Identification Number',
   'description': 'Unique identifier assigned to each property parcel.',
   'semantic_group': 'id',
   'pandas_dtype': 'int64',
   'n_missing': 0,
   'missing_pct': 0.0,
   'n_unique': 2930,
   'example_values': [526301100, 526350040, 526351010, 526353030, 527105010]}),
 ('MS SubClass',
  {'meaning': 'Building Class',
   'description': 'Identifies the type of dwelling (e.g., 1-story, 2-story, split-level).',
   'semantic_group': 'categorical_nominal',
   'pandas_dtype': 'int64',
   'n_missing': 0,
   'missing_pct': 0.0,
   'n_unique': 16,
   'example_values': [20, 60, 120, 50, 85]}),
 ('MS Zoning',
  {'meaning': 'Zoning Classification',
   'descriptio

In [18]:
import os
import json   # ← THIS is what was missing

os.makedirs("data/metadata", exist_ok=True)

with open("data/metadata/variable_dictionary_full.json", "w") as f:
    json.dump(FULL_VARIABLE_DICT, f, indent=4)

print("Saved full dictionary to data/metadata/variable_dictionary_full.json")

Saved full dictionary to data/metadata/variable_dictionary_full.json


In [19]:
import os
from pathlib import Path

print("Generating Markdown variable dictionary from FULL_VARIABLE_DICT...")

# Make sure the metadata folder exists
os.makedirs("data/metadata", exist_ok=True)

output_path = Path("data/metadata/variable_dictionary.md")

lines = []
lines.append("# Variable Dictionary\n")
lines.append(
    "This document summarizes all variables in the Ames Housing dataset, "
    "including their semantic meaning, type, and basic data profile.\n"
)

# If you want to preserve the original column order, use FULL_VARIABLE_DICT keys directly
for var_name, meta in FULL_VARIABLE_DICT.items():
    meaning = meta.get("meaning") or "N/A"
    description = meta.get("description") or "N/A"
    semantic_group = meta.get("semantic_group") or "N/A"
    pandas_dtype = meta.get("pandas_dtype") or "N/A"
    n_missing = meta.get("n_missing", "N/A")
    missing_pct = meta.get("missing_pct", "N/A")
    n_unique = meta.get("n_unique", "N/A")
    example_values = meta.get("example_values") or []
    example_values_str = ", ".join(map(str, example_values)) if example_values else "N/A"

    lines.append(f"## `{var_name}`\n")
    lines.append(f"- **Meaning:** {meaning}")
    lines.append(f"- **Description:** {description}")
    lines.append(f"- **Semantic group:** `{semantic_group}`")
    lines.append(f"- **Pandas dtype:** `{pandas_dtype}`")
    lines.append(f"- **Missing values:** {n_missing} ({missing_pct}%)")
    lines.append(f"- **Unique values:** {n_unique}")
    lines.append(f"- **Example values:** {example_values_str}\n")

# Write to file
output_path.write_text("\n".join(lines), encoding="utf-8")

print(f"Markdown dictionary saved to: {output_path}")

Generating Markdown variable dictionary from FULL_VARIABLE_DICT...
Markdown dictionary saved to: data/metadata/variable_dictionary.md


In [20]:
# -------------------------------------------------------------------
# Start from raw dataframe
# -------------------------------------------------------------------
df_clean = df.copy()

# -------------------------------------------------------------------
# 1. Drop ID-like column "Order", keep PID
# -------------------------------------------------------------------
if "Order" in df_clean.columns:
    df_clean = df_clean.drop(columns=["Order"])

# -------------------------------------------------------------------
# 2. Semantic groups (must match our data_cleaning_policy.md)
# -------------------------------------------------------------------
temporal_cols = ["Year Built", "Year Remod/Add", "Garage Yr Blt", "Yr Sold"]

numeric_continuous_cols = [
    "Lot Frontage",
    "Lot Area",
    "Mas Vnr Area",
    "BsmtFin SF 1",
    "BsmtFin SF 2",
    "Bsmt Unf SF",
    "Total Bsmt SF",
    "1st Flr SF",
    "2nd Flr SF",
    "Low Qual Fin SF",
    "Gr Liv Area",
    "Garage Area",
    "Wood Deck SF",
    "Open Porch SF",
    "Enclosed Porch",
    "3Ssn Porch",
    "Screen Porch",
    "Pool Area",
    "Misc Val",
    "SalePrice",
]

numeric_discrete_cols = [
    "Bsmt Full Bath",
    "Bsmt Half Bath",
    "Full Bath",
    "Half Bath",
    "Bedroom AbvGr",
    "Kitchen AbvGr",
    "TotRms AbvGrd",
    "Fireplaces",
    "Garage Cars",
]

categorical_ordinal_cols = [
    "Overall Qual",
    "Overall Cond",
    "Exter Qual",
    "Exter Cond",
    "Bsmt Qual",
    "Bsmt Cond",
    "Bsmt Exposure",
    "BsmtFin Type 1",
    "BsmtFin Type 2",
    "Heating QC",
    "Kitchen Qual",
    "Fireplace Qu",
    "Garage Qual",
    "Garage Cond",
    "Pool QC",
    "Fence",
    "Functional",
]

categorical_nominal_cols = [
    "MS SubClass",
    "MS Zoning",
    "Street",
    "Alley",
    "Lot Shape",
    "Land Contour",
    "Utilities",
    "Lot Config",
    "Land Slope",
    "Neighborhood",
    "Condition 1",
    "Condition 2",
    "Bldg Type",
    "House Style",
    "Roof Style",
    "Roof Matl",
    "Exterior 1st",
    "Exterior 2nd",
    "Mas Vnr Type",
    "Foundation",
    "Heating",
    "Central Air",
    "Electrical",
    "Garage Type",
    "Garage Finish",
    "Paved Drive",
    "Misc Feature",
    "Sale Type",
    "Sale Condition",
    "Mo Sold",
]

# -------------------------------------------------------------------
# 3. Temporal columns → datetime (YYYY-01-01)
#    Garage Yr Blt: fill missing with Year Built before conversion
# -------------------------------------------------------------------
def year_to_date(series: pd.Series) -> pd.Series:
    """Convert a year-like numeric series to a datetime (YYYY-01-01)."""
    # Use pandas nullable integer first to allow NaN safely if needed
    years = series.astype("Int64")
    return pd.to_datetime(years.astype("string") + "-01-01")

# Fill missing garage year with house construction year
df_clean["Garage Yr Blt"] = df_clean["Garage Yr Blt"].fillna(df_clean["Year Built"])

for col in temporal_cols:
    df_clean[col] = year_to_date(df_clean[col])

# -------------------------------------------------------------------
# 4. Numeric continuous: missing values
#    - Lot Frontage: median per Neighborhood
#    - Others: global median
# -------------------------------------------------------------------
# Lot Frontage by neighborhood
if "Lot Frontage" in df_clean.columns:
    lot_frontage_global_median = df_clean["Lot Frontage"].median()
    lot_frontage_by_neigh = (
        df_clean.groupby("Neighborhood")["Lot Frontage"].median()
    )

    def impute_lot_frontage(row):
        if pd.notna(row["Lot Frontage"]):
            return row["Lot Frontage"]
        neigh = row["Neighborhood"]
        neigh_median = lot_frontage_by_neigh.get(neigh, np.nan)
        if pd.notna(neigh_median):
            return neigh_median
        return lot_frontage_global_median

    df_clean["Lot Frontage"] = df_clean.apply(impute_lot_frontage, axis=1)

# Other continuous variables: simple median imputation
for col in numeric_continuous_cols:
    if col == "Lot Frontage":
        continue  # already handled
    if col in df_clean.columns:
        median_value = df_clean[col].median()
        df_clean[col] = df_clean[col].fillna(median_value)

# -------------------------------------------------------------------
# 5. Numeric discrete: mode imputation + int64
# -------------------------------------------------------------------
for col in numeric_discrete_cols:
    if col not in df_clean.columns:
        continue
    mode_value = df_clean[col].mode(dropna=True)
    if not mode_value.empty:
        mode_value = mode_value.iloc[0]
        df_clean[col] = df_clean[col].fillna(mode_value)
    df_clean[col] = df_clean[col].astype("int64")

# -------------------------------------------------------------------
# 6. Categorical nominal:
#    - standardize strings
#    - "no feature" → 'None'
#    - others → mode
# -------------------------------------------------------------------
no_feature_nominal = [
    "Alley",
    "Misc Feature",
    "Mas Vnr Type",
]

for col in categorical_nominal_cols:
    if col not in df_clean.columns:
        continue

    # Normalize to string and strip whitespace
    df_clean[col] = df_clean[col].astype("string").str.strip()

    if col in no_feature_nominal:
        df_clean[col] = df_clean[col].fillna("None")
    else:
        if df_clean[col].isna().any():
            mode_value = df_clean[col].mode(dropna=True)
            if not mode_value.empty:
                mode_value = mode_value.iloc[0]
                df_clean[col] = df_clean[col].fillna(mode_value)

# -------------------------------------------------------------------
# 7. Categorical ordinal: mapping to integers, with safe default
# -------------------------------------------------------------------

# Shared quality scale (includes NA-like categories as 0)
quality_map = {
    "NA": 0,
    "None": 0,
    "<NA>": 0,
    "Po": 1,
    "Fa": 2,
    "TA": 3,
    "Gd": 4,
    "Ex": 5,
}

# Basement exposure
bsmt_exposure_map = {
    "NA": 0,
    "None": 0,
    "<NA>": 0,
    "No": 1,
    "Mn": 2,
    "Av": 3,
    "Gd": 4,
}

# Basement finish types
bsmt_fin_type_map = {
    "NA": 0,
    "None": 0,
    "<NA>": 0,
    "Unf": 1,
    "LwQ": 2,
    "Rec": 3,
    "BLQ": 4,
    "ALQ": 5,
    "GLQ": 6,
}

# Functional
functional_map = {
    "Sal": 1,
    "Sev": 2,
    "Maj2": 3,
    "Maj1": 4,
    "Mod": 5,
    "Min2": 6,
    "Min1": 7,
    "Typ": 8,
}

def map_with_default(series: pd.Series, mapping: dict, default: int = 0) -> pd.Series:
    """
    Safely map string categories to integers:
    - convert to string
    - apply mapping
    - any missing/unmapped → default
    - return int64
    """
    series = series.astype("string")
    mapped = series.map(mapping)
    mapped = mapped.fillna(default)
    return mapped.astype("int64")

# Overall quality/condition already numeric ordinal: ensure int
df_clean["Overall Qual"] = df_clean["Overall Qual"].astype("int64")
df_clean["Overall Cond"] = df_clean["Overall Cond"].astype("int64")

# Quality-like variables using quality_map
for col in [
    "Exter Qual", "Exter Cond", "Bsmt Qual", "Bsmt Cond",
    "Heating QC", "Kitchen Qual", "Fireplace Qu",
    "Garage Qual", "Garage Cond", "Pool QC", "Fence"
]:
    if col in df_clean.columns:
        df_clean[col] = map_with_default(df_clean[col], quality_map, default=0)

# Basement exposure
if "Bsmt Exposure" in df_clean.columns:
    df_clean["Bsmt Exposure"] = map_with_default(
        df_clean["Bsmt Exposure"], bsmt_exposure_map, default=0
    )

# Basement finish types
for col in ["BsmtFin Type 1", "BsmtFin Type 2"]:
    if col in df_clean.columns:
        df_clean[col] = map_with_default(df_clean[col], bsmt_fin_type_map, default=0)

# Functional (missing treated as "Typ" by default, then mapped)
if "Functional" in df_clean.columns:
    df_clean["Functional"] = (
        df_clean["Functional"]
        .astype("string")
        .fillna("Typ")
        .map(functional_map)
        .fillna(functional_map["Typ"])
        .astype("int64")
    )

In [21]:
# -------------------------------------------------------------------
# 8. Save cleaned dataset
# -------------------------------------------------------------------
os.makedirs("data/cleaned", exist_ok=True)

csv_path = "data/cleaned/ames_cleaned.csv"
df_clean.to_csv(csv_path, index=False)

print(f"Cleaned dataset saved to: {csv_path}")
print("df_clean shape:", df_clean.shape)
print("Any remaining NaN?", df_clean.isna().sum().sum())

Cleaned dataset saved to: data/cleaned/ames_cleaned.csv
df_clean shape: (2930, 81)
Any remaining NaN? 0


In [22]:
# Final safety check: convert all remaining NaN to safe defaults

# Categorical columns: fill missing with 'None'
cat_cols = df_clean.select_dtypes(include=["object", "string"]).columns
df_clean[cat_cols] = df_clean[cat_cols].fillna("None")

# Numeric columns: fill missing with median just to be safe
num_cols = df_clean.select_dtypes(include=["int64", "float64"]).columns
for col in num_cols:
    df_clean[col] = df_clean[col].fillna(df_clean[col].median())

# Temporal columns (datetime): fill with a valid dummy date
date_cols = df_clean.select_dtypes(include=["datetime64[ns]"]).columns
for col in date_cols:
    df_clean[col] = df_clean[col].fillna(pd.to_datetime("1900-01-01"))

# Save again
df_clean.to_csv("data/cleaned/ames_cleaned.csv", index=False)

print("Final cleaning safety applied.")
print("Remaining missing:", df_clean.isna().sum().sum())

Final cleaning safety applied.
Remaining missing: 0


In [23]:
# Final safety check: convert all remaining NaN to safe defaults

cat_cols = df_clean.select_dtypes(include=["object", "string"]).columns
df_clean[cat_cols] = df_clean[cat_cols].fillna("None")

num_cols = df_clean.select_dtypes(include=["int64", "float64"]).columns
for col in num_cols:
    df_clean[col] = df_clean[col].fillna(df_clean[col].median())

date_cols = df_clean.select_dtypes(include=["datetime64[ns]"]).columns
for col in date_cols:
    df_clean[col] = df_clean[col].fillna(pd.to_datetime("1900-01-01"))

df_clean.to_csv("data/cleaned/ames_cleaned.csv", index=False)

print("Final cleaning safety applied.")
print("Remaining missing:", df_clean.isna().sum().sum())

Final cleaning safety applied.
Remaining missing: 0
