### Introduction :
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

### Practice Skills
Creative feature engineering 
Advanced regression techniques like random forest and gradient boosting

### Acknowledgments
The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. 

### Data :
train.csv and test.csv was provided by kaggle.

### Project Goals :
The objective of this challenge is to use our knowledge of machine learning and data science to predict house prices.

### Data :
In this project, we will use the train.csv data to form the model and test.csv to test our model.
Each column in the dataset represents a characteristic and each row is a sample.

### Analysis : 
In this project we will use our knowledge of descriptive statistics and data visualisation to summarise the data. As we have to make a prediction, we will use regression algorithms typical of supervised learning.
We will use cross-validation to check the generalizability of our models, grid search to tune the parameters of the best model
Since our variables are numerous, we will use principal component analysis (PCA) to visualize the data.

### Evaluation :
We will use test data for evaluation and the r2 score as a metric for our evaluation on test data.

### Import of modules

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

### Collecting the data 

In [20]:
#Train Data
train_data = pd.read_csv("train.csv")
id_train_data = train_data["Id"]
train_data = train_data.drop("Id",axis = 1)
y = train_data.loc[:,"SalePrice"]

In [9]:
#Examine the structure of the data

In [10]:
train_data.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


In [11]:
#Data description

In [12]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   Alley          91 non-null     object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuil

#### You can have a more complete description of our variables in the file data_description.txt .

In [13]:
categorical_features = train_data.select_dtypes(include="object").columns
numeric_features = train_data.select_dtypes(exclude="object").columns

In [14]:
#Statistical summary of numerical variables.
train_data[numeric_features].describe()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,46.549315,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,161.319273,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,0.0,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,0.0,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1474.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [17]:
#Summary of categorical variables
for col in categorical_features:
    print(f"Variable  {col} : ")
    print(train_data[col].value_counts())
    print()
    print()


Variable  MSZoning : 
RL         1151
RM          218
FV           65
RH           16
C (all)      10
Name: MSZoning, dtype: int64


Variable  Street : 
Pave    1454
Grvl       6
Name: Street, dtype: int64


Variable  Alley : 
Grvl    50
Pave    41
Name: Alley, dtype: int64


Variable  LotShape : 
Reg    925
IR1    484
IR2     41
IR3     10
Name: LotShape, dtype: int64


Variable  LandContour : 
Lvl    1311
Bnk      63
HLS      50
Low      36
Name: LandContour, dtype: int64


Variable  Utilities : 
AllPub    1459
NoSeWa       1
Name: Utilities, dtype: int64


Variable  LotConfig : 
Inside     1052
Corner      263
CulDSac      94
FR2          47
FR3           4
Name: LotConfig, dtype: int64


Variable  LandSlope : 
Gtl    1382
Mod      65
Sev      13
Name: LandSlope, dtype: int64


Variable  Neighborhood : 
NAmes      225
CollgCr    150
OldTown    113
Edwards    100
Somerst     86
Gilbert     79
NridgHt     77
Sawyer      74
NWAmes      73
SawyerW     59
BrkSide     58
Crawfor     51
Mi

### Creation of the test set

In [19]:
test_data = pd.read_csv("test.csv")
id_test_data = test_data["Id"]
test_data = test_data.drop("Id",axis = 1)
test_data

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,6,2010,WD,Normal
4,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,Inside,...,144,0,,,,0,1,2010,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,160,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,6,2006,WD,Normal
1455,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,4,2006,WD,Abnorml
1456,20,RL,160.0,20000,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,9,2006,WD,Abnorml
1457,85,RL,62.0,10441,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,MnPrv,Shed,700,7,2006,WD,Normal


### Data exploration

#### It is difficult to use the traditional method to visualize a data set with many dimensions. That is why we will use PCA for our visualization.

In [21]:
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector as selector

In [22]:
categorical_features = ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
       'SaleType', 'SaleCondition']

numeric_features =['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd',
       'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
       'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea',
       'MiscVal', 'MoSold', 'YrSold']

# Scale numeric values
num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# One-hot encode categorical values
cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_transformer, selector(dtype_include='float64')),
        ('cat', cat_transformer, selector(dtype_include='category'))])
X_train, X_test, y_train, y_test = train_test_split(train_data, y, test_size=0.2,
                                                    random_state=0)

In [23]:
housing_scared = preprocessor.fit_transform(X_train)
housing_pca = PCA(n_components=3)
components = housing_pca.fit_transform(housing_scared)
#components = pd.DataFrame(components).transpose()
var_ratio = housing_pca.explained_variance_ratio_
print(var_ratio)
housing_pcomp = pd.DataFrame(components)
housing_pcomp = housing_pcomp.iloc[:,:]
housing_pcomp.columns = ['Comp1', 'Comp2','Comp3']
print(housing_pcomp.head())

[0.44006144 0.3148389  0.24509966]
      Comp1     Comp2     Comp3
0  2.427752  0.046970  0.399610
1 -0.979238  0.003048  0.124365
2 -1.576714  1.681734  0.908329
3  0.745894 -0.541808 -0.420741
4  1.756796  1.502401 -1.035206


In [33]:
housing_pcomp["SalePrice"] = y[:len(housing_pcomp)]
#finalDf = pd.concat([housing_pcomp, , axis = 1)
housing_pcomp

Unnamed: 0,Comp1,Comp2,Comp3,SalePrice
0,2.427752,0.046970,0.399610,208500
1,-0.979238,0.003048,0.124365,181500
2,-1.576714,1.681734,0.908329,223500
3,0.745894,-0.541808,-0.420741,140000
4,1.756796,1.502401,-1.035206,250000
...,...,...,...,...
1163,2.873462,-0.244450,1.601259,108959
1164,-0.135692,-0.703762,-0.745119,194000
1165,-0.414379,0.000974,-0.390617,233170
1166,0.293005,-0.487781,-0.970579,245350


In [None]:
#Data visualization
sns.pairplot(housing_pcomp)

#### Computation of the Pearson coefficient, to determine how our components behave with our target variable.

In [36]:
corr = housing_pcomp.corr()
corr["SalePrice"].sort_values(ascending = False)

SalePrice    1.000000
Comp2        0.063848
Comp3       -0.021595
Comp1       -0.026943
Name: SalePrice, dtype: float64

### MODEL SELECTION AND TRAINING

#### In the exploration phase, we cleaned and prepared the data. So we will move on to the choices of supervised learning algorithms for the regression.

In [43]:
housing_scared = housing_scared
y_train = y_train
X_test = preprocessor.transform(X_test)