# House Prices: Advanced Regression Techniques
[link]: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data "Link"
[kaggle_link_house_prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)  

train.csv: 1460x81  
test.csv: 1429x80
```
SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale
```




In [212]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import sklearn.metrics
data = pd.read_csv('train.csv')
data.iloc[:5,1:-1]
X = data.iloc[:,1:-1]
#y = data.iloc[:,-1]
y =  data.pop('SalePrice').values
X.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,2,2008,WD,Normal
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,0,,,,0,5,2007,WD,Normal
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,9,2008,WD,Normal
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,,0,2,2006,WD,Abnorml
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,0,,,,0,12,2008,WD,Normal


### 確認是否有缺失

In [95]:
#X.info()
x_null_count = X.isnull().sum(axis = 0)
columns_name = X.columns.values
columns_has_nan_name = columns_name[x_null_count > 0]
print('Has %d columns Nan'%len(columns_has_nan_name))
print(X[columns_has_nan_name].iloc[:,:5].head())
print(X[columns_has_nan_name].iloc[:,5:10].head())
print(X[columns_has_nan_name].iloc[:,10:15].head())
print(X[columns_has_nan_name].iloc[:,15:].head())

Has 19 columns Nan
   LotFrontage Alley MasVnrType  MasVnrArea BsmtQual
0         65.0   NaN    BrkFace       196.0       Gd
1         80.0   NaN       None         0.0       Gd
2         68.0   NaN    BrkFace       162.0       Gd
3         60.0   NaN       None         0.0       TA
4         84.0   NaN    BrkFace       350.0       Gd
  BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2 Electrical
0       TA           No          GLQ          Unf      SBrkr
1       TA           Gd          ALQ          Unf      SBrkr
2       TA           Mn          GLQ          Unf      SBrkr
3       Gd           No          ALQ          Unf      SBrkr
4       TA           Av          GLQ          Unf      SBrkr
  FireplaceQu GarageType  GarageYrBlt GarageFinish GarageQual
0         NaN     Attchd       2003.0          RFn         TA
1          TA     Attchd       1976.0          RFn         TA
2          TA     Attchd       2001.0          RFn         TA
3          Gd     Detchd       1998.0          Un

In [96]:
x_null_count[x_null_count>0]

LotFrontage      259
Alley           1369
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64

### Electrical 只有一筆缺失
刪除該筆

In [142]:
#找到該筆
Electrical_column_index = list(np.where(columns_name=='Electrical')[0])
X[X['Electrical'].isnull()].iloc[:,Electrical_column_index]
row_index = list(np.where(X['Electrical'].isnull()==True)[0])
X_fix_Electrical = X.drop(row_index,axis=0)
X_fix_Electrical[X_fix_Electrical['Electrical'].isnull()]
x_null_count = X_fix_Electrical.isnull().sum(axis = 0)
x_null_count[x_null_count>0]

LotFrontage      259
Alley           1368
MasVnrType         8
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
FireplaceQu      689
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1452
Fence           1178
MiscFeature     1405
dtype: int64

### MasVnrType
MasVnrType: Masonry veneer type 石材單板貼面類型  
BrkCmn:   Brick Common  
BrkFace:  Brick Face  
CBlock:   Cinder Block  
None:     None  
Stone:    Stone  
### MasVnrArea 
MasVnrArea: Masonry veneer area in square feet  
Ex:   Excellent  
Gd:   Good  
TA:   Average/Typical  
Fa:   Fair  
Po:   Poor  

In [179]:
MasVnrType_column_index = list(np.where(columns_name=='MasVnrType')[0])
#X_fix_Electrical[:10]['MasVnrType']
X_fix_Electrical[X_fix_Electrical['MasVnrType'].isnull()].iloc[:,MasVnrType_column_index]
row_index = list(np.where(X_fix_Electrical['MasVnrType'].isnull()==True)[0])
X_fix_MasVnrType = X_fix_Electrical.drop(row_index,axis=0)

x_null_count = X_fix_MasVnrType.isnull().sum(axis = 0)
x_null_count[x_null_count>0]



LotFrontage      257
Alley           1361
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
FireplaceQu      685
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1444
Fence           1170
MiscFeature     1397
dtype: int64

### BsmtQual 
BsmtQual: Evaluates the height of the basement 地下室高度  
Ex:   Excellent (100+ inches)  
Gd:  Good (90-99 inches)  
TA:   Typical (80-89 inches)  
Fa:   Fair (70-79 inches)  
Po:   Poor (&lt;70 inches  
NA:   No Basement  
### BsmtCond
BsmtCond: Evaluates the general condition of the basement  
Ex   Excellent  
Gd   Good  
TA   Typical - slight dampness allowed  
Fa   Fair - dampness or some cracking or settling  
Po   Poor - Severe cracking, settling, or wetness  
NA   No Basement  
### BsmtExposure
BsmtExposure: Refers to walkout or garden level walls  
Gd:   Good Exposure  
Av:   Average Exposure (split levels or foyers typically score average or above)  
Mn:   Mimimum Exposure  
No:   No Exposure  
NA:   No Basement  
### BsmtFinType1
BsmtFinType1: Rating of basement finished area    
GLQ  Good Living Quarters  
ALQ  Average Living Quarters  
BLQ  Below Average Living Quarters     
Rec  Average Rec Room  
LwQ  Low Quality  
Unf  Unfinshed  
NA   No Basement  
### BsmtFinSF1  
BsmtFinSF1: Type 1 finished square feet  
### BsmtFinType2  
BsmtFinType2: Rating of basement finished area (if multiple types)  
GLQ  Good Living Quarters  
ALQ  Average Living Quarters  
BLQ  Below Average Living Quarters   
Rec  Average Rec Room  
LwQ  Low Quality  
Unf  Unfinshed  
NA   No Basement  
BsmtFinSF2: Type 2 finished square feet  
BsmtUnfSF: Unfinished square feet of basement area  
TotalBsmtSF: Total square feet of basement area  




In [209]:

label='BsmtExposure'
BsmtExposure_column_index = list(np.where(columns_name==label)[0])
X_fix_MasVnrType[X_fix_MasVnrType['BsmtExposure'].isnull()].iloc[:,BsmtExposure_column_index]
row_index = list(np.where(X_fix_MasVnrType['BsmtExposure'].isnull()==True)[0])
#print(row_index)
#a = X_fix_MasVnrType.drop([17, 39, 90, 102, 156, 182, 258, 341, 361, 370, 391, 519, 530, 531, 551, 644, 702, 733, 746, 775, 865, 891, 894, 944, 978, 994, 1005, 1029, 1039, 1042, 1043, 1084, 1173, 1210, 1212, 1226, 1313, 1403])
#a.info()
#print(row_index)
#print('size '+str(len(row_index)))

#X_fix_MasVnrType[X_fix_MasVnrType['BsmtExposure'].isnull()].iloc[:,BsmtExposure_column_index]
#row_index = list(np.where(X_fix_MasVnrType['BsmtExposure'].isnull()==True)[0])
#before = X_fix_MasVnrType.shape[0]

#X_fix_BsmtExposure = X_fix_MasVnrType.drop(row_index,axis=0)
#X_fix_BsmtExposure = X_fix_MasVnrType.drop(row_index,axis=0)
#print(before- X_fix_BsmtExposure.shape[0])

#x_null_count = X_fix_BsmtExposure.isnull().sum(axis = 0)
#x_null_count[x_null_count>0]

[17, 39, 90, 102, 156, 182, 258, 341, 361, 370, 391, 519, 530, 531, 551, 644, 702, 733, 746, 775, 865, 891, 894, 944, 978, 994, 1005, 1029, 1039, 1042, 1043, 1084, 1173, 1210, 1212, 1226, 1313, 1403]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1413 entries, 0 to 1459
Data columns (total 79 columns):
MSSubClass       1413 non-null int64
MSZoning         1413 non-null object
LotFrontage      1162 non-null float64
LotArea          1413 non-null int64
Street           1413 non-null object
Alley            89 non-null object
LotShape         1413 non-null object
LandContour      1413 non-null object
Utilities        1413 non-null object
LotConfig        1413 non-null object
LandSlope        1413 non-null object
Neighborhood     1413 non-null object
Condition1       1413 non-null object
Condition2       1413 non-null object
BldgType         1413 non-null object
HouseStyle       1413 non-null object
OverallQual      1413 non-null int64
OverallCond      1413 non-null int64
YearBuilt     

### LotFrontage: Linear feet of street connected to property

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [41]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge
model = LinearRegression()
#model.fit(X_train,y_train)



### HouseStyle 
列出類型

In [247]:
from sklearn.preprocessing import OneHotEncoder
one = OneHotEncoder(sparse=False)
#X['HouseStyle'].value_counts()
hs= X[['HouseStyle']].copy()
hs_transform = one.fit_transform(hs)

hs_transform.shape
# check transform data is ok
np.array_equal(hs.values,one.inverse_transform(hs_transform))

True

In [242]:
feature_name = one.get_feature_names()
feature_name

array(['x0_1.5Fin', 'x0_1.5Unf', 'x0_1Story', 'x0_2.5Fin', 'x0_2.5Unf',
       'x0_2Story', 'x0_SFoyer', 'x0_SLvl'], dtype=object)

In [244]:
one.inverse_transform([hs_transform[0]])

array([['2Story']], dtype=object)

### OneHotEncoder可以設handle_unknown='ignore'
如果有沒看過的類別就會當作[0,0,...,0]  
