## House Prices - Advanced Regression Techniques

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home, I am going to implement a ML regression model to predict each house price.

[you can know more about th data here](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data)

In [1]:
# install requirements packages
#!pip install -r requirements.txt

In [2]:
# import libs 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# download house-prices-advanced-regression-techniques data using kaggle API
#!kaggle competitions download -c house-prices-advanced-regression-techniques --force

In [4]:
# unzip house-prices-advanced-regression-techniques.zip to get the data files
#!unzip house-prices-advanced-regression-techniques.zip

In [5]:
# ls the data files 
!ls

House Prices - Advanced Regression Techniques.ipynb
data_description.txt
docker-compose
house-prices-advanced-regression-techniques.zip
requirements.txt
sample_submission.csv
test.csv
train.csv
~


In [6]:
# import data to df
df = pd.read_csv("train.csv")

### Eploratory Data Analysis

In [7]:
# see the data
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [8]:
# get info about the df rows and columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [9]:
# explore the data description file to understand the data 
!cat data_description.txt

MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM

In [19]:
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = {'column_name': df.columns,
                                 'percent_missing': percent_missing}

[0.0,
 0.0,
 0.0,
 17.73972602739726,
 0.0,
 0.0,
 93.76712328767124,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.547945205479452,
 0.547945205479452,
 0.0,
 0.0,
 0.0,
 2.5342465753424657,
 2.5342465753424657,
 2.6027397260273974,
 2.5342465753424657,
 0.0,
 2.6027397260273974,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0684931506849315,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 47.26027397260274,
 5.5479452054794525,
 5.5479452054794525,
 5.5479452054794525,
 0.0,
 0.0,
 5.5479452054794525,
 5.5479452054794525,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 99.52054794520548,
 80.75342465753425,
 96.3013698630137,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0]

In [28]:
# get columns with null values and percentages of null values 
def print_null_percentage(dataframe):
    # get a list of the null percentages
    null_percents = [x for x in dataframe.isnull().sum().values / len(df)]

    # indexing variable
    i = 0
    for column in dataframe.columns:
        # print column name and the null value 
        print("{}.{} : {}".format(i, column, null_percents[i]))

        # increment the indexing variable
        i += 1

print_null_percentage(df) 

0.Id : 0.0
1.MSSubClass : 0.0
2.MSZoning : 0.0
3.LotFrontage : 0.1773972602739726
4.LotArea : 0.0
5.Street : 0.0
6.Alley : 0.9376712328767123
7.LotShape : 0.0
8.LandContour : 0.0
9.Utilities : 0.0
10.LotConfig : 0.0
11.LandSlope : 0.0
12.Neighborhood : 0.0
13.Condition1 : 0.0
14.Condition2 : 0.0
15.BldgType : 0.0
16.HouseStyle : 0.0
17.OverallQual : 0.0
18.OverallCond : 0.0
19.YearBuilt : 0.0
20.YearRemodAdd : 0.0
21.RoofStyle : 0.0
22.RoofMatl : 0.0
23.Exterior1st : 0.0
24.Exterior2nd : 0.0
25.MasVnrType : 0.005479452054794521
26.MasVnrArea : 0.005479452054794521
27.ExterQual : 0.0
28.ExterCond : 0.0
29.Foundation : 0.0
30.BsmtQual : 0.025342465753424658
31.BsmtCond : 0.025342465753424658
32.BsmtExposure : 0.026027397260273973
33.BsmtFinType1 : 0.025342465753424658
34.BsmtFinSF1 : 0.0
35.BsmtFinType2 : 0.026027397260273973
36.BsmtFinSF2 : 0.0
37.BsmtUnfSF : 0.0
38.TotalBsmtSF : 0.0
39.Heating : 0.0
40.HeatingQC : 0.0
41.CentralAir : 0.0
42.Electrical : 0.0006849315068493151
43.1stFlrS

- column at index Alley has 94% of the values is null
- column at index PoolQC has 100% of the values is null
- column at index Fence  has 81% of the values is null
- column at index MiscFeature  has 96% of the values is null

__[Alley, PoolQC , Fence , MiscFeature]__ __These columns has a very high Null values percentages, They will be dropped beacause they are very hard to impute or replaced__

- column at index LotFrontage  has 18% of the values is null
- columns at index MasVnrType, MasVnrArea, BsmtFinType2, Electrical, GarageType, GarageYrBlt, GarageFinish, GarageQual, GarageCond have near 1% of the values is null
- column at index BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1 has 3% of the values is null
- column at index FireplaceQu has 47% of the values is null, relatively high null values percentage

__These column null values will be cleaned__

In [29]:
# drop columns at index [6, 72, 73, 74] step01
df_01 = df.drop(["Alley", "PoolQC" , "Fence" , "MiscFeature"], axis = 1)