1. Break MSSubClass into several features
    - number of stories
    - age of home (pre_1945)
    - number of floors (overlaps with "HouseStyle")
    - if the house is located in a development or not (ie, does it have a HOA?)
    - if it is a single-family home

2. Pseudo-quantifiable to quantifiable
    - Street
    - Alley
    - LandSlope
    - Condition1 and Condition2
        - Consider creating some sort of "Noise" variable?
            - +6 for RR adjacent
            - +4 for RR near
            - +2 for feeder
            - +4 for artery
        - Consider creating an "ambiance" feature?
            - +1 for nearby positive
            - +3 for adjacent positive
    - BsmtQual -> basement-height
        - Ex = 105
        - Gd = 95
        - TA = 85
        - Fa = 75
        - Po = 65
        - NA = 0
    - BsmtCond -> basement-condition
    - BsmtExposure -> basement-exposure
    - KitchenQual -> kitchen-quality
    - Functional -> overall-functionality
    - FireplaceQu -> fireplace-quality


3. Create dummies
    - Utilities (Electricity, sewer, water, gas)
    - Exterior Finish (Exterior1st, Exterior2nd)
        - exterior-brick (BrkComm, BrkFace)
        - exterior-stone (Stone)
        - exterior-asbestos (AsbShng)
        - exterior-asphalt (AsphShn)
        - exterior-unfinished (CBlock, CemntBoard, HdBoard, Plywood, PreCast)
        - exterior-wood (Wd Sdng, WdShing)
        - exterior-plaster (ImStucc, Stucco)
        - exterior-plastic (VinylSid)
        - exterior-metal (MetalSd)
        - exterior-other (Other)
        - exterior-durable
            - T = OR(exterior-brick, exterior-asbestos, exterior-asphalt, exterior-plastic, exterior-metal)
    - Foundation
        - foundation-modern (CBlock, PConc, Slab)
    - Basement
        - basement-present (T/F)
        - basement-bedroom (GLQ, ALQ, BLQ)
        - basement-recreation (Rec)
        - basement-unfinished (LwQ, Unf)
    - Heating
        - **Do research on these types of furnaces to determine what the modern ones are**
    - CentralAir -> air-conditioning (T/F)
    - Garages
        - garage-present
            - F = CarPort, NA
        - garage-attached (2Types, Attchd, Basment, BuiltIn)
        - garage



4. Flag remodelling 
    - if YearBuilt != YearRemodAdd
    - years since remodel?
   

In [47]:
import os
import sys

PROJECT_ROOT = \
    os.path.abspath(os.path.join(
        os.path.dirname(""),
        os.pardir))

sys.path.append(PROJECT_ROOT)

In [48]:
import pandas as pd
import numpy as np

In [49]:
train = pd.read_csv(PROJECT_ROOT + "/data/interim/train-cleaned.csv")
test = pd.read_csv(PROJECT_ROOT + "/data/interim/test-cleaned.csv")

In [50]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 82 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   Unnamed: 0                             1460 non-null   int64  
 1   id                                     1460 non-null   int64  
 2   ms-subclass                            1460 non-null   object 
 3   ms-zoning                              1460 non-null   object 
 4   lot-frontage                           1460 non-null   int64  
 5   lot-area                               1460 non-null   int64  
 6   roads-street-material                  1460 non-null   object 
 7   roads-alley-material                   1460 non-null   object 
 8   lot-shape                              1460 non-null   object 
 9   lot-flatness                           1460 non-null   object 
 10  utilities-city-provided                1460 non-null   object 
 11  lot-

Need to address columns that contain strings, since they are not going to work with most ML techniques

In [51]:
def print_string_columns():
    print("The following columns contain string values:")

    str_cols = []

    for column in train.columns:
        if type(train.loc[0, column]) == str:
            str_cols.append(column)
            print(f"({len(str_cols)}) " + column)

print_string_columns()

The following columns contain string values:
(1) ms-subclass
(2) ms-zoning
(3) roads-street-material
(4) roads-alley-material
(5) lot-shape
(6) lot-flatness
(7) utilities-city-provided
(8) lot-access
(9) lot-slope
(10) location-neighborhood
(11) location-feature-1
(12) location-feature-2
(13) general-dwelling-type
(14) construction-style
(15) construction-roof-style
(16) construction-roof-material
(17) construction-exterior-1
(18) construction-exterior-2
(19) construction-masonry-type
(20) ratings-exterior-quality
(21) ratings-exterior-condition
(22) construction-foundation-type
(23) ratings-basement-quality
(24) ratings-basement-condition
(25) construction-basement-access
(26) construction-basement-finish-1
(27) construction-basement-finish-2
(28) utilities-heating-type
(29) ratings-heating-combined
(30) utilities-central-air-conditioning
(31) utilities-electrical-wiring-type
(32) ratings-kitchen-quality
(33) ratings-home-functionality
(34) garage-location
(35) garage-finish
(36) rati

In [52]:
train["ms-subclass"].value_counts()

ms-subclass
1-STORY 1946 & NEWER ALL STYLES                          536
2-STORY 1946 & NEWER                                     299
1-1/2 STORY FINISHED ALL AGES                            144
1-STORY PUD (Planned Unit Development) - 1946 & NEWER     87
1-STORY 1945 & OLDER                                      69
2-STORY PUD - 1946 & NEWER                                63
2-STORY 1945 & OLDER                                      60
SPLIT OR MULTI-LEVEL                                      58
DUPLEX - ALL STYLES AND AGES                              52
2 FAMILY CONVERSION - ALL STYLES AND AGES                 30
SPLIT FOYER                                               20
2-1/2 STORY ALL AGES                                      16
1-1/2 STORY - UNFINISHED ALL AGES                         12
PUD - MULTILEVEL - INCL SPLIT LEV/FOYER                   10
1-STORY W/FINISHED ATTIC ALL AGES                          4
Name: count, dtype: int64

There are a few different things that can be extracted from the ms-subclass column:
* Number of stories
* Year built
* If it is part of a planned unit development (PUD or subdivision)
* If it is a multi-family home (duplex)

Only some of this information is not contained within the remainder of the dataframe:
* Number of stories
* PUD status
* Multi-family status

In [53]:
STORY_MAP = {
    "1-STORY 1946 & NEWER ALL STYLES": 1,
    "2-STORY 1946 & NEWER": 2,
    "1-1/2 STORY FINISHED ALL AGES": 1.5,
    "1-STORY PUD (Planned Unit Development) - 1946 & NEWER": 1,
    "1-STORY 1945 & OLDER": 1,
    "2-STORY PUD - 1946 & NEWER": 2,
    "2-STORY 1945 & OLDER": 2,
    "SPLIT OR MULTI-LEVEL": 1.5,
    "DUPLEX - ALL STYLES AND AGES": 2,
    "2 FAMILY CONVERSION - ALL STYLES AND AGES": 2,
    "SPLIT FOYER": 1.5,
    "2-1/2 STORY ALL AGES": 2.5,
    "1-1/2 STORY - UNFINISHED ALL AGES": 1.5,
    "PUD - MULTILEVEL - INCL SPLIT LEV/FOYER": 1.5,
    "1-STORY W/FINISHED ATTIC ALL AGES": 1.5}

REGION_MAP =  \
    {'bloomington-heights':'location-far-north',  
     'bluestem':'location-southwest',
     'briardale':'location-near-north',
     'brookside':'location-downtown',
     'clear-creek':'location-near-west',
     'college-creek':'location-far-west',
     'crawford':'location-southwest',
     'edwards':'location-near-west',
     'gilberts':'location-far-north',
     'greens':'location-northwest',
     'green-hill':'location-far-southwest',
     'iowa-dot-rail-road':'location-downtown',
     'landmark':'location-far-west',
     'meadow-village':'location-south',
     'mitchell':'location-south',
     'north-ames':'location-near-north',
     'northpart-villa':'location-near-north',
     'northwest-ames':'location-near-north',
     'northridge':'location-northwest',
     'northridge-heights': 'location-northwest',
     'old-town':'location-downtown',
     'south-and-west-iowa-state':'location-southwest',
     'sawyer':'location-near-west',
     'sawyer-west':'location-far-west',
     'somerset':'location-northwest',
     'stone-brook':'location-far-north',
     'timberland':'location-far-southwest',
     'veenker':'location-near-west'}

ZONING_MAP = {
    "residential-low-density": 1,
    "residential-medium-density": 2,
    "residential-floating-village": 1,
    "residential-high-density": 3}

QUALITY_MAP = {
    "Po": 0,
    "TA": 1,
    "Fa": 2,
    "Gd": 3,
    "Ex": 4}

FUNCTIONALITY_MAP = {
    "Sev": 0,
    "Maj2": 1,
    "Maj1": 2,
    "Mod": 3,
    "Min2": 4,
    "Min1": 5,
    "Typ": 6}

EXPOSURE_MAP = {
    "No": 0,
    "Mn": 1,
    "Av": 2,
    "Gd": 3}

FENCE_MAP = {
    np.nan: 0,
    "MnWw": 1,
    "GdWo": 2,
    "MnPrv": 3,
    "GdPrv": 4}

In [54]:
string_columns = []
unused_columns = []

In [55]:
train["pud"] = train["ms-subclass"].str.contains("PUD")
train["multi-family"] = train["ms-subclass"].str.contains("DUPLEX")
train["stories"] = train["ms-subclass"].map(STORY_MAP)
test["pud"] = test["ms-subclass"].str.contains("PUD")
test["multi-family"] = test["ms-subclass"].str.contains("DUPLEX")
test["stories"] = test["ms-subclass"].map(STORY_MAP)

string_columns.append("ms-subclass")

# train = train.drop("ms-subclass", axis="columns")
# test = test.drop("ms-subclass", axis="columns")

In [56]:
train["zoning-density"] = train["ms-zoning"].map(ZONING_MAP)
test["zoning-density"] = test["ms-zoning"].map(ZONING_MAP)

string_columns.append("ms-zoning")

# train = train.drop("ms-zoning", axis="columns")
# test = test.drop("ms-zoning", axis="columns")

In [57]:
train["roads-paved"] = train["roads-street-material"].str.contains("paved")
test["roads-paved"] = test["roads-street-material"].str.contains("paved")

string_columns.append("roads-street-material")

# train = train.drop("roads-street-material", axis="columns")
# test = test.drop("roads-street-material", axis="columns")

In [58]:
train["has-alley"] = ~train["roads-alley-material"].str.contains("none")
test["has-alley"] = ~test["roads-alley-material"].str.contains("none")

string_columns.append("roads-alley-material")

# train = train.drop("roads-alley-material", axis="columns")
# test = test.drop("roads-alley-material", axis="columns")

In [59]:
train["all-city-utilities"] = train["utilities-city-provided"].str.contains("electricity-gas-water-sewer")
test["all-city-utilities"] = test["utilities-city-provided"].str.contains("electricity-gas-water-sewer")

string_columns.append("utilities-city-provided")

# train = train.drop("utilities-city-provided", axis="columns")
# test = test.drop("utilities-city-provided", axis="columns")

Location features include information on non-lot features near the lot in question. To make this useable in the ML model, I will split this information into several different columns:
* "adjacent-railroad"
* "adjacent-traffic"
* "adjacent-park"
* "near-railroad"
* "near-park"

In [60]:
train["adjacent-railroad"] = \
    train["location-feature-1"].str.contains("RRAn") \
        | train["location-feature-1"].str.contains("RRAe") \
        | train["location-feature-2"].str.contains("RRAn") \
        | train["location-feature-2"].str.contains("RRAe")
train["near-railroad"] = \
    train["location-feature-1"].str.contains("RRNn") \
        | train["location-feature-1"].str.contains("RRNe") \
        | train["location-feature-2"].str.contains("RRNn") \
        | train["location-feature-2"].str.contains("RRNe") \
        | train["adjacent-railroad"]
train["adjacent-traffic"] = \
    train["location-feature-1"].str.contains("Artery") \
        | train["location-feature-1"].str.contains("Feedr") \
        | train["location-feature-2"].str.contains("Artery") \
        | train["location-feature-2"].str.contains("Feedr")
train["adjacent-park"] = \
    train["location-feature-1"].str.contains("PosA") \
        | train["location-feature-2"].str.contains("PosA")
train["near-park"] = \
    train["location-feature-1"].str.contains("PosN") \
        | train["location-feature-2"].str.contains("PosN") \
        | train["adjacent-park"]
test["adjacent-railroad"] = \
    test["location-feature-1"].str.contains("RRAn") \
        | test["location-feature-1"].str.contains("RRAe") \
        | test["location-feature-2"].str.contains("RRAn") \
        | test["location-feature-2"].str.contains("RRAe")
test["near-railroad"] = \
    test["location-feature-1"].str.contains("RRNn") \
        | test["location-feature-1"].str.contains("RRNe") \
        | test["location-feature-2"].str.contains("RRNn") \
        | test["location-feature-2"].str.contains("RRNe") \
        | test["adjacent-railroad"]
test["adjacent-traffic"] = \
    test["location-feature-1"].str.contains("Artery") \
        | test["location-feature-1"].str.contains("Feedr") \
        | test["location-feature-2"].str.contains("Artery") \
        | test["location-feature-2"].str.contains("Feedr")
test["adjacent-park"] = \
    test["location-feature-1"].str.contains("PosA") \
        | test["location-feature-2"].str.contains("PosA")
test["near-park"] = \
    test["location-feature-1"].str.contains("PosN") \
        | test["location-feature-2"].str.contains("PosN") \
        | test["adjacent-park"]

string_columns.append("location-feature-1")
string_columns.append("location-feature-2")

# train = train.drop(["location-feature-1", "location-feature-2"], axis="columns")
# test = test.drop(["location-feature-1", "location-feature-2"], axis="columns")

In [61]:
train["multi-family"] = ~train["general-dwelling-type"].str.contains("1Fam")
test["multi-family"] = ~test["general-dwelling-type"].str.contains("1Fam")

string_columns.append("general-dwelling-type")

# train = train.drop("general-dwelling-type", axis="columns")
# test = test.drop("general-dwelling-type", axis="columns")

Construction style contains no new additional information beyond what was extracted from ms-subclass, so we can drop it

In [62]:
unused_columns.append("construction-style")

# train = train.drop("construction-style", axis="columns")
# test = test.drop("construction-style", axis="columns")

In [63]:
train["siding-unfinished"] = \
    train["construction-exterior-1"].str.contains("CBlock") \
    | train["construction-exterior-1"].str.contains("Other") \
    | train["construction-exterior-1"].str.contains("PreCast") \
    | train["construction-exterior-1"].str.contains("Plywood") \
    | train["construction-exterior-2"].str.contains("CBlock") \
    | train["construction-exterior-2"].str.contains("Other") \
    | train["construction-exterior-2"].str.contains("PreCast") \
    | train["construction-exterior-2"].str.contains("Plywood")

train["siding-brick"] = \
    train["construction-exterior-1"].str.contains("BrkComm") \
    | train["construction-exterior-1"].str.contains("BrkFace") \
    | train["construction-exterior-2"].str.contains("BrkComm") \
    | train["construction-exterior-2"].str.contains("BrkFace")

train["siding-natural"] = \
    train["construction-exterior-1"].str.contains("Stone") \
    | train["construction-exterior-1"].str.contains("Stucco") \
    | train["construction-exterior-1"].str.contains("Wd Sdng") \
    | train["construction-exterior-1"].str.contains("WdShing") \
    | train["construction-exterior-2"].str.contains("Stone") \
    | train["construction-exterior-2"].str.contains("Stucco") \
    | train["construction-exterior-2"].str.contains("Wd Sdng") \
    | train["construction-exterior-2"].str.contains("WdShing")

train["siding-durable"] = \
    train["construction-exterior-1"].str.contains("CemntBd") \
    | train["construction-exterior-1"].str.contains("HdBoard") \
    | train["construction-exterior-1"].str.contains("ImStucc") \
    | train["construction-exterior-1"].str.contains("MetalSd") \
    | train["construction-exterior-1"].str.contains("VinylSd") \
    | train["construction-exterior-2"].str.contains("CemntBd") \
    | train["construction-exterior-2"].str.contains("HdBoard") \
    | train["construction-exterior-2"].str.contains("ImStucc") \
    | train["construction-exterior-2"].str.contains("MetalSd") \
    | train["construction-exterior-2"].str.contains("VinylSd")


test["siding-unfinished"] = \
    test["construction-exterior-1"].str.contains("CBlock") \
    | test["construction-exterior-1"].str.contains("Other") \
    | test["construction-exterior-1"].str.contains("PreCast") \
    | test["construction-exterior-1"].str.contains("Plywood") \
    | test["construction-exterior-2"].str.contains("CBlock") \
    | test["construction-exterior-2"].str.contains("Other") \
    | test["construction-exterior-2"].str.contains("PreCast") \
    | test["construction-exterior-2"].str.contains("Plywood")

test["siding-brick"] = \
    test["construction-exterior-1"].str.contains("BrkComm") \
    | test["construction-exterior-1"].str.contains("BrkFace") \
    | test["construction-exterior-2"].str.contains("BrkComm") \
    | test["construction-exterior-2"].str.contains("BrkFace")

test["siding-natural"] = \
    test["construction-exterior-1"].str.contains("Stone") \
    | test["construction-exterior-1"].str.contains("Stucco") \
    | test["construction-exterior-1"].str.contains("Wd Sdng") \
    | test["construction-exterior-1"].str.contains("WdShing") \
    | test["construction-exterior-2"].str.contains("Stone") \
    | test["construction-exterior-2"].str.contains("Stucco") \
    | test["construction-exterior-2"].str.contains("Wd Sdng") \
    | test["construction-exterior-2"].str.contains("WdShing")

test["siding-durable"] = \
    test["construction-exterior-1"].str.contains("CemntBd") \
    | test["construction-exterior-1"].str.contains("HdBoard") \
    | test["construction-exterior-1"].str.contains("ImStucc") \
    | test["construction-exterior-1"].str.contains("MetalSd") \
    | test["construction-exterior-1"].str.contains("VinylSd") \
    | test["construction-exterior-2"].str.contains("CemntBd") \
    | test["construction-exterior-2"].str.contains("HdBoard") \
    | test["construction-exterior-2"].str.contains("ImStucc") \
    | test["construction-exterior-2"].str.contains("MetalSd") \
    | test["construction-exterior-2"].str.contains("VinylSd")

string_columns.append("construction-exterior-1")
string_columns.append("construction-exterior-2")

# train = train.drop(["construction-exterior-1", "construction-exterior-2"], axis="columns")
# test = test.drop(["construction-exterior-1", "construction-exterior-2"], axis="columns")

In [64]:
train["roof-modern"] = \
    train["construction-roof-material"].str.contains("CompShg")
test["roof-modern"] = \
    test["construction-roof-material"].str.contains("CompShg")

string_columns.append("construction-roof-material")

# train = train.drop("construction-roof-material", axis="columns")
# test = test.drop("construction-roof-material", axis="columns")

In [65]:
# train["exterior-condition"] = \
#     train["ratings-exterior-condition"].map(QUALITY_MAP)
# test["exterior-condition"] = \
#     test["ratings-exterior-condition"].map(QUALITY_MAP)

# string_columns.append("ratings-exterior-condition")

In [66]:
# train["exterior-quality"] = \
#     train["ratings-exterior-quality"].map(QUALITY_MAP)
# test["exterior-quality"] = \
#     test["ratings-exterior-quality"].map(QUALITY_MAP)

# string_columns.append("ratings-exterior-quality")

lot-fence-material
* fence-quality

In [67]:
train["fence-quality"] = train["lot-fence-material"].map(FENCE_MAP)
test["fence-quality"] = train["lot-fence-material"].map(FENCE_MAP)

string_columns.append("lot-fence-material")

Lot misc feature
* shed
* tennis-court
* second-garage

In [68]:
train["shed"] = train["lot-misc-feature"].str.contains("Shed")
train["tennis-court"] = train["lot-misc-feature"].str.contains("TenC")
train["second-garage"] = train["lot-misc-feature"].str.contains("Gar2")

test["shed"] = test["lot-misc-feature"].str.contains("Shed")
test["tennis-court"] = test["lot-misc-feature"].str.contains("TenC")
test["second-garage"] = test["lot-misc-feature"].str.contains("Gar2")

string_columns.append("lot-misc-feature")

In [69]:
ratings_columns = \
    ["exterior-condition", "exterior-quality",
     "basement-quality", "basement-condition", 
     "heating-combined", "kitchen-quality", 
     "garage-quality", "garage-condition",
     "fireplace-quality", "pool-combined"]

for col in ratings_columns:
    train[col] = \
        train["ratings-" + col].map(QUALITY_MAP)
    test[col] = \
        test["ratings-" + col].map(QUALITY_MAP)

    string_columns.append("ratings-" + col)    


In [70]:
train["home-functionality"] = \
    train["ratings-home-functionality"].map(FUNCTIONALITY_MAP)

test["home-functionality"] = \
    test["ratings-home-functionality"].map(FUNCTIONALITY_MAP)

string_columns.append("ratings-home-functionality")

In [71]:
# train = train.drop(["construction-roof-style", "construction-masonry-type"], axis="columns")
# test = test.drop(["construction-roof-style", "construction-masonry-type"], axis="columns")
unused_columns.append("construction-roof-style")
unused_columns.append("construction-masonry-type")

Basement finish information:
* basement-unfinished
* basement-bedroom
* basement-rec-room

In [72]:
train["basement-unfinished"] = \
    train["construction-basement-finish-1"].str.contains("Unf") \
    | train["construction-basement-finish-2"].str.contains("Unf")

train["basement-bedroom"] = \
    train["construction-basement-finish-1"].str.contains("LQ") \
    | train["construction-basement-finish-1"].str.contains("LwQ") \
    | train["construction-basement-finish-2"].str.contains("LQ") \
    | train["construction-basement-finish-2"].str.contains("LwQ")

train["basement-rec-room"] = \
    train["construction-basement-finish-1"].str.contains("Rec") \
    | train["construction-basement-finish-2"].str.contains("Rec")


test["basement-unfinished"] = \
    test["construction-basement-finish-1"].str.contains("Unf") \
    | test["construction-basement-finish-2"].str.contains("Unf")

test["basement-bedroom"] = \
    test["construction-basement-finish-1"].str.contains("LQ") \
    | test["construction-basement-finish-1"].str.contains("LwQ") \
    | test["construction-basement-finish-2"].str.contains("LQ") \
    | test["construction-basement-finish-2"].str.contains("LwQ")

test["basement-rec-room"] = \
    test["construction-basement-finish-1"].str.contains("Rec") \
    | test["construction-basement-finish-2"].str.contains("Rec")

string_columns.append("construction-basement-finish-1")
string_columns.append("construction-basement-finish-2")

# train = \
#     train.drop(["construction-basement-finish-1", "construction-basement-finish-2"],
#                axis="columns")

# test = \
#     test.drop(["construction-basement-finish-1", "construction-basement-finish-2"],
#                axis="columns")

Foundation type information:
* foundation-modern

In [73]:
train["foundation-modern"] = \
    train["construction-foundation-type"].str.contains("PConc") \
    | train["construction-foundation-type"].str.contains("CBlock")

test["foundation-modern"] = \
    test["construction-foundation-type"].str.contains("PConc") \
    | test["construction-foundation-type"].str.contains("CBlock")

string_columns.append("construction-foundation-type")

# train = train.drop("construction-foundation-type", axis="columns")
# test = test.drop("construction-foundation-type", axis="columns")

Heating type information:
* heating-gas

In [74]:
train["heating-gas"] = \
    train["utilities-heating-type"].str.contains("Gas")
test["heating-gas"] = \
    test["utilities-heating-type"].str.contains("Gas")

string_columns.append("utilities-heating-type")

# train = train.drop("utilities-heating-type", axis="columns")
# test = test.drop("utilities-heating-type", axis="columns")

In [75]:
train["central-air-conditioning"] = \
    train["utilities-central-air-conditioning"].str.contains("Y")
test["central-air-conditioning"] = \
    test["utilities-central-air-conditioning"].str.contains("Y")

string_columns.append("utilities-central-air-conditioning")

# train = train.drop("utilities-central-air-conditioning", axis="columns")
# test = test.drop("utilities-central-air-conditioning", axis="columns")

In [76]:
train["electric-modern"] = \
    train["utilities-electrical-wiring-type"].str.contains("SBrkr")
test["electric-modern"] = \
    test["utilities-electrical-wiring-type"].str.contains("SBrkr")

string_columns.append("utilities-electrical-wiring-type")

# train = train.drop("utilities-electrical-wiring-type", axis="columns")
# test = test.drop("utilities-electrical-wiring-type", axis="columns")

In [77]:
train["garage-dettached"] = \
    train["garage-location"].str.contains("Detchd")
test["garage-dettached"] = \
    test["garage-location"].str.contains("Detchd")

string_columns.append("garage-location")

# train = train.drop("garage-location", axis="columns")
# test = test.drop("garage-location", axis="columns")

In [78]:
train["garage-finished"] = \
    train["garage-finish"].str.contains("Fin")
test["garage-finished"] = \
    test["garage-finish"].str.contains("Fin")

string_columns.append("garage-finish")

# train = train.drop("garage-finish", axis="columns")
# test = test.drop("garage-finish", axis="columns")

In [79]:
train["driveway-paved"] = \
    train["roads-driveway-material"].str.contains("Y")
test["driveway-paved"] = \
    test["roads-driveway-material"].str.contains("Y")

string_columns.append("roads-driveway-material")

# train = train.drop("roads-driveway-material", axis="columns")
# test = test.drop("roads-driveway-material", axis="columns")

Sale type information:
* new-construction
* cash-sale

In [80]:
train["new-construction"] = \
    train["general-sale-type"].str.contains("New")
test["new-construction"] = \
    test["general-sale-type"].str.contains("New")

train["cash-sale"] = \
    train["general-sale-type"].str.contains("CWD")
test["cash-sale"] = \
    test["general-sale-type"].str.contains("CWD")

string_columns.append("general-sale-type")

# train = train.drop("general-sale-type", axis="columns")
# test = test.drop("general-sale-type", axis="columns")

Sale condition information
* normal-sale
* family-sale

In [81]:
train["normal-sale"] = \
    train["general-sale-condition"].str.contains("Normal")
train["family-sale"] = \
    train["general-sale-condition"].str.contains("Family")

test["normal-sale"] = \
    test["general-sale-condition"].str.contains("Normal")
test["family-sale"] = \
    test["general-sale-condition"].str.contains("Family")

string_columns.append("general-sale-condition")

In [82]:
train["basement-access"] = \
    train["construction-basement-access"].map(EXPOSURE_MAP)
test["basement-access"] = \
    test["construction-basement-access"].map(EXPOSURE_MAP)

string_columns.append("construction-basement-access")

In [83]:
train["location-region"] = train["location-neighborhood"].map(REGION_MAP)
test["location-region"] = test["location-neighborhood"].map(REGION_MAP)

In [84]:
train_region_dummies = pd.get_dummies(train["location-region"])
train_neighborhood_dummies = pd.get_dummies(train["location-neighborhood"])

test_region_dummies = pd.get_dummies(test["location-region"])
test_neighborhood_dummies = pd.get_dummies(test["location-neighborhood"])

train = pd.concat([train, train_region_dummies, train_neighborhood_dummies], axis="columns")
test = pd.concat([test, test_region_dummies, test_neighborhood_dummies], axis="columns")

string_columns.append("location-region")
string_columns.append("location-neighborhood")

In [85]:
lot_cols = ["lot-shape", "lot-flatness", "lot-access", "lot-slope"]
for col in lot_cols:
    string_columns.append(col) 

In [86]:
string_columns.append("id")
unused_columns.append("id")

train_strings = train[string_columns]
train_unused = train[unused_columns]

test_strings = test[string_columns]
test_unused = test[unused_columns]

string_columns.remove("id")
unused_columns.remove("id")

train = train \
    .drop(string_columns, axis="columns") \
    .drop(unused_columns, axis="columns")

test = test \
    .drop(string_columns, axis="columns") \
    .drop(unused_columns, axis="columns")

In [87]:
print("The following columns contain NaN values:")
for col in train.columns:
    if train[col].isna().sum() > 0:
        print(f"{col} contains {train[col].isna().sum()} nan values")

The following columns contain NaN values:
garage-build-year contains 81 nan values
shed contains 1406 nan values
tennis-court contains 1406 nan values
second-garage contains 1406 nan values
basement-quality contains 37 nan values
basement-condition contains 37 nan values
garage-quality contains 81 nan values
garage-condition contains 81 nan values
fireplace-quality contains 690 nan values
pool-combined contains 1453 nan values
garage-dettached contains 81 nan values
garage-finished contains 81 nan values
basement-access contains 38 nan values


### Managing NaN's
Fill NaN with False
* tennis-court
* shed
* second-garage
* garage-finished

Check if NaN should be filled with True or False
* garage-dettached

Fill NaN with mean
* basement-quality
* basement-condition
* garage-quality
* garage-condition
* fireplace-quality
* pool-combined
* basement-access

In [88]:
fill_false = ["tennis-court", "shed", "second-garage", "garage-finished", "garage-dettached"]
for col in fill_false:
    train[col] = train[col].fillna(value=False)
    test[col] = test[col].fillna(value=False)

fill_mean = ["basement-quality", "basement-condition", "garage-quality", "garage-condition", "fireplace-quality", "pool-combined", "basement-access", "garage-build-year"]
for col in fill_mean:
    train[col] = train[col].fillna(
        value=train[col].mean())
    test[col] = test[col].fillna(
        value=test[col].mean())

In [89]:
print("The following columns contain NaN values:")
for col in train.columns:
    if train[col].isna().sum() > 0:
        print(f"{col} contains {train[col].isna().sum()} nan values")

The following columns contain NaN values:


In [90]:
train = train.drop("Unnamed: 0", axis="columns")
test = test.drop("Unnamed: 0", axis="columns")

In [91]:
train.to_csv(PROJECT_ROOT+"/data/processed/primary-engineering/train-engineered.csv")
train_strings.to_csv(PROJECT_ROOT+"/data/processed/primary-engineering/residual/train-string-columns.csv")
train_unused.to_csv(PROJECT_ROOT+"/data/processed/primary-engineering/residual/train-unused-columns.csv")

test.to_csv(PROJECT_ROOT+"/data/processed/primary-engineering/test-engineered.csv")
test_strings.to_csv(PROJECT_ROOT+"/data/processed/primary-engineering/residual/test-string-columns.csv")
test_unused.to_csv(PROJECT_ROOT+"/data/processed/primary-engineering/residual/test-unused-columns.csv")