# House Prices Challenge

Data fields
Here's a brief version of what you'll find in the data description file.

|Column|Description|Column|Description|Column|Description|
|------|------|------|------|------|------|
|SalePrice | the property's sale price in dollars. This is the target variable that you're trying to predict.| MSSubClass | The building class|MSZoning | The general zoning classification|LotFrontage | Linear feet of street connected to property|
|LotArea | Lot size in square feet|Street | Type of road access|Alley | Type of alley access|LotShape | General shape of property|
|LandContour | Flatness of the property|Utilities | Type of utilities available|LotConfig | Lot configuration|LandSlope | Slope of property|
|Neighborhood | Physical locations within Ames city limits|Condition1 | Proximity to main road or railroad|Condition2 | Proximity to main road or railroad (if a second is present)|BldgType | Type of dwelling|
|HouseStyle | Style of dwelling|OverallQual | Overall material and finish quality|OverallCond | Overall condition rating|YearBuilt | Original construction date|
|YearRemodAdd | Remodel date|RoofStyle | Type of roof|RoofMatl | Roof material|Exterior1st | Exterior covering on house|
|Exterior2nd | Exterior covering on house (if more than one material)|MasVnrType | Masonry veneer type|MasVnrArea | Masonry veneer area in square feet|ExterQual | Exterior material quality|
|ExterCond | Present condition of the material on the exterior|Foundation | Type of foundation|BsmtQual | Height of the basement|BsmtCond | General condition of the basement|
|BsmtExposure | Walkout or garden level basement walls|BsmtFinType1 | Quality of basement finished area|BsmtFinSF1 | Type 1 finished square feet|BsmtFinType2 | Quality of second finished area (if present)|
|BsmtFinSF2 | Type 2 finished square feet|BsmtUnfSF | Unfinished square feet of basement area|TotalBsmtSF | Total square feet of basement area|Heating | Type of heating|
|HeatingQC | Heating quality and condition|CentralAir | Central air conditioning|Electrical | Electrical system|1stFlrSF | First Floor square feet|
|2ndFlrSF | Second floor square feet|LowQualFinSF | Low quality finished square feet (all floors)|GrLivArea | Above grade (ground) living area square feet|BsmtFullBath | Basement full bathrooms|
|BsmtHalfBath | Basement half bathrooms|FullBath | Full bathrooms above grade|HalfBath | Half baths above grade|Bedroom | Number of bedrooms above basement level|
|Kitchen | Number of kitchens|KitchenQual | Kitchen quality|TotRmsAbvGrd | Total rooms above grade (does not include bathrooms)|Functional | Home functionality rating|
|Fireplaces | Number of fireplaces|FireplaceQu | Fireplace quality|GarageType | Garage location|GarageYrBlt | Year garage was built|
|GarageFinish | Interior finish of the garage|GarageCars | Size of garage in car capacity|GarageArea | Size of garage in square feet|GarageQual | Garage quality|
|GarageCond | Garage condition|PavedDrive | Paved driveway|WoodDeckSF | Wood deck area in square feet|OpenPorchSF | Open porch area in square feet|
|EnclosedPorch | Enclosed porch area in square feet|3SsnPorch | Three season porch area in square feet|ScreenPorch | Screen porch area in square feet|PoolArea | Pool area in square feet|
|PoolQC | Pool quality|Fence | Fence quality|MiscFeature | Miscellaneous feature not covered in other categories|MiscVal | $Value of miscellaneous feature|
|MoSold | Month Sold|YrSold | Year Sold|SaleType | Type of sale|SaleCondition | Condition of sale|



In [None]:
import pandas as pd

from sklearn.model_selection import train_test_split

SEED=42

In [None]:
df = pd.read_csv('/data/01_raw/house-pricing.csv')
X_train, y_train, X_test, y_test = train_test_split(df.drop(columns=['SalePrice', 'id']), df['SalePrice'], test_size=0.3, random_state=SEED)
X_train.head()

In [None]:
for name in list(X_train.columns):
  if (X_train[name].isna().any()) or (X_train[name].isnull().any()):
    print(f"train | {name} : {str(X_train[name].isna().any())} | Null count: {str(X_train[name].isna().sum())} or {str(X_train[name].isnull().sum())}")

In [None]:
X_train.duplicated(subset=None, keep='first')

In [None]:
def pre_processing(df):
  # lots_of_zeros = ['MasVnrArea', 'BsmtFinSF2', 'LowQualFinSF', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF', '2ndFlrSF' 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch']
  list_miss_to_mean = ['LotFrontage']
  remove_cols = []
  list_remove_cols_missing = ['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']
  list_remove_cols_zeros = ['PoolArea','MiscVal']
  list_remove_rows_garage = ['GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageQual', 'GarageCond','GarageCars', 'GarageArea']
  list_remove_rows_bsmt = ['BsmtExposure', 'BsmtQual', 'BsmtCond', 'BsmtFinType1','BsmtFinType2']
  list_remove_rows_little = ['MasVnrType', 'MasVnrArea', 'Electrical']

  if 'Alley' in df.columns:
    remove_cols = list_remove_cols_missing + list_remove_cols_zeros
    df = df.drop(remove_cols, axis=1)

  # df = df.dropna(subset=list_remove_rows_garage).dropna(subset=list_remove_rows_bsmt).dropna(subset=list_remove_rows_little)

  df[list_miss_to_mean[0]] = df.fillna(df[list_miss_to_mean[0]].mean())[list_miss_to_mean[0]]

  return df

In [None]:
# df = pre_processing(df)