## What is a Decision tree?
#### Decision tree learning uses a decision tree as a predictive model which maps observations about an item, represented in the branches, to conclusions about the item's target value, represented in the leaves.
<br>

For example, in the dataset of HOME selected here, the dataset for target will be about the sales price of the homes based on some given features.

##### First step - is to load the data set -

In [1]:
import pandas as pd

In [17]:
file_path = 'C:/Users/hp/Desktop/MACHINE LEARNING/HOME-DATASET/train.csv'

In [18]:
home_data = pd.read_csv(file_path)

In [19]:
home_data.head(15)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
5,6,50,RL,85.0,14115,Pave,,IR1,Lvl,AllPub,...,0,,MnPrv,Shed,700,10,2009,WD,Normal,143000
6,7,20,RL,75.0,10084,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,307000
7,8,60,RL,,10382,Pave,,IR1,Lvl,AllPub,...,0,,,Shed,350,11,2009,WD,Normal,200000
8,9,50,RM,51.0,6120,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2008,WD,Abnorml,129900
9,10,190,RL,50.0,7420,Pave,,Reg,Lvl,AllPub,...,0,,,,0,1,2008,WD,Normal,118000


##### Now the dataset has been loaded. 
##### Next, select the data for modeling, that is the target data and the features.

In [20]:
home_data.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [21]:
y = home_data.SalePrice

In [22]:
features = ['LotArea', 'YearBuilt', '1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd']
X = home_data[features]

In [23]:
X.describe()

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,10516.828082,1971.267808,1162.626712,346.992466,1.565068,2.866438,6.517808
std,9981.264932,30.202904,386.587738,436.528436,0.550916,0.815778,1.625393
min,1300.0,1872.0,334.0,0.0,0.0,0.0,2.0
25%,7553.5,1954.0,882.0,0.0,1.0,2.0,5.0
50%,9478.5,1973.0,1087.0,0.0,2.0,3.0,6.0
75%,11601.5,2000.0,1391.25,728.0,2.0,3.0,7.0
max,215245.0,2010.0,4692.0,2065.0,3.0,8.0,14.0


In [24]:
X.head(5)

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
0,8450,2003,856,854,2,3,8
1,9600,1976,1262,0,2,3,6
2,11250,2001,920,866,2,3,6
3,9550,1915,961,756,1,3,7
4,14260,2000,1145,1053,2,4,9


##### Now the model building is done. scikit-learn library is used to build models and is written as sklearn. The steps to build models are:
 - Define -> Define the type of the model to be used
 - Fit -> Fit the provided data into the model selected
 - Predict -> Predict the accuracy of the model after fitting it with data
 - Evaluate -> Evaluate the ready made model to check for accuracy

In [11]:
from sklearn.tree import DecisionTreeRegressor

In [25]:
home_model = DecisionTreeRegressor(random_state=1)
home_model.fit(X,y)

DecisionTreeRegressor(random_state=1)

In [26]:
predictions = home_model.predict(X)
print(predictions)

[208500. 181500. 223500. ... 266500. 142125. 147500.]


In [27]:
home_model.predict(X.head(5))

array([208500., 181500., 223500., 140000., 250000.])

### Now the model has been built. Next step is to ensure the quality of the model. Model validation is done for this purpose, and this is to improve model quality. 
#### There are many ways to do model validation, one of the effective way is finding Mean Absolute Error (called MAE), where error = actual - predicted.


In [28]:
from sklearn.metrics import mean_absolute_error

In [31]:
prediction_price = home_model.predict(X)
mean_absolute_error(y,prediction_price)

62.35433789954339

#### Till now, we were using the same set of data for training the model and testing the model. This will give inaccurate predictions as the model is known to the set of data while training itself. So, the data is splitted into training dataset and testing dataset and is then served into the model to look into the accuracy of the model.

In [34]:
from sklearn.model_selection import train_test_split
y = home_data.SalePrice
features = ['LotArea', 'YearBuilt', '1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd']
X = home_data[features]

In [35]:
train_X, test_X, train_y, test_y = train_test_split(X,y,random_state=0)

In [36]:
new_home_model = DecisionTreeRegressor()
new_home_model.fit(train_X,train_y)

DecisionTreeRegressor()

In [37]:
new_prediction = new_home_model.predict(test_X)
print(mean_absolute_error(test_y,new_prediction))

33026.55616438356


So, this is the actual error that we will be getting on working with this model while giving it unknown values. <br> This is the difference between a model that is almost exactly right, and one that is unusable for most practical purposes. As a point of reference, the average home value in the validation data is 1.1 million dollars. So the error in new data is about a quarter of the average home value. <br>

There are many ways to improve this model, such as experimenting to find better features or different model types.

Now, we know a way to measure about the model accuracy (at least one way :))

### Now getting into the concept of over-fitting and under-fitting.
#### Basically, we would look for the number of leaf nodes of the decision tree for which we will get least MAE. 

In [38]:
y = home_data.SalePrice
features = ['LotArea', 'YearBuilt', '1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd']
X = home_data[features]

In [39]:
train_X, test_X, train_y, test_y = train_test_split(X,y,random_state=0)

In [40]:
def get_mae(leaf_nodes, train_X, test_X, train_y, test_y):
    model = DecisionTreeRegressor(max_leaf_nodes=leaf_nodes, random_state=0)
    model.fit(train_X,train_y)
    prediction = model.predict(test_X)
    mae = mean_absolute_error(test_y, prediction)
    return mae

In [41]:
for leaf_nodes in [10,100,500,1000,5000,10000]:
    values = get_mae(leaf_nodes, train_X, test_X, train_y, test_y)
    print('FOR LEAF NODES %d,   MAE IS %d'%(leaf_nodes,values))

FOR LEAF NODES 10,   MAE IS 32577
FOR LEAF NODES 100,   MAE IS 33648
FOR LEAF NODES 500,   MAE IS 32549
FOR LEAF NODES 1000,   MAE IS 33966
FOR LEAF NODES 5000,   MAE IS 32089
FOR LEAF NODES 10000,   MAE IS 32953


So, from above we can see that the MAE is having minimum value of 32089 at 5000 leaf nodes so this is the optimal number of leaves for this model.