# Regression Trees
### Economics 588
##### Jacob Van Leeuwen, John Bonney, Erik Webb, Taylor Landon, Rachel Bagnall, Scott Elliott, Jaimie Choi, Isaac Riley

## Illustration - Medical Diagnosis

Imagine a physician evaluating a potential medical diagnosis for a patient. How can she take advantage of machine learning in an intuitive way without relying on a black-box algorithm? Regression trees can act as an effective decision-making mechanism that provide adequate classification accuracy and a simple representation of gathered knowledge.

Suppose the physician is considering a type 2 diabetes diagnosis and that there are two key medical tests: a glycated hemoglobin test and a fasting blood sugar test. The physician has access to the dataset generated below, which contains a diabetes indicator variable and the patients corresponding test result values. 

In [5]:
# References
# https://www.mayoclinic.org/diseases-conditions/type-2-diabetes/diagnosis-treatment/drc-20351199
# https://pythonprogramminglanguage.com/decision-tree-visual-example/

import graphviz 
from sklearn import tree
import random
import decimal
import pandas as pd

glycated_hemoglobin_test_normal = []
glycated_hemoglobin_test_diabetes = []
fasting_blood_sugar_test_normal = []
fasting_blood_sugar_test_diabetes = []

for i in range(0, 500):
    # Glycated Hemoglobin Test
    glycated_hemoglobin_test_normal.append(float(decimal.Decimal(random.randrange(0, 640))/100))
    glycated_hemoglobin_test_diabetes.append(float(decimal.Decimal(random.randrange(570, 800))/100))
    # Fasting Blood Sugar Test
    fasting_blood_sugar_test_normal.append(float(decimal.Decimal(random.randrange(500, 1050))/100))
    fasting_blood_sugar_test_diabetes.append(float(decimal.Decimal(random.randrange(1000, 1260))/100))
    
glycated_hemoglobin_test = glycated_hemoglobin_test_normal + glycated_hemoglobin_test_diabetes
fasting_blood_sugar_test = fasting_blood_sugar_test_normal + fasting_blood_sugar_test_diabetes

diabetes_dummy = ([0] * 500) + ([1]*500)
d = {'Glycated Hemoglobin':glycated_hemoglobin_test,
     'Fasting Blood Sugar':fasting_blood_sugar_test,
     'Diabetes': diabetes_dummy}
df = pd.DataFrame(d)
df.head()

ModuleNotFoundError: No module named 'graphviz'

Armed with this dataset, she could fit the following decision tree classification model, specifying that the tree  grow only two layers deep so that she can clearly communicate the diagnosis to the patient. 

In [4]:
# Training
Y = df['Diabetes']
X = df[['Fasting Blood Sugar', 'Glycated Hemoglobin']]
clf = tree.DecisionTreeClassifier(max_depth = 2)
clf = clf.fit(X,Y)

# Visualize the Tree
dot_data = tree.export_graphviz(clf, feature_names = ['Fasting Blood Sugar', 'Glycated Hemoglobin'], 
                                label = 'root', 
                                leaves_parallel = True, 
                                out_file = None, 
                                impurity = False, 
                                filled=True, 
                                rounded=True, 
                                rotate = False, 
                                class_names = True)
graph = graphviz.Source(dot_data)  
graph

NameError: name 'df' is not defined

The color of the leaf nodes correspond to the majority class within that node; here the leaf nodes colored orange contain a majority of the non-diabetes class and those colored blue contain a majority of the diabetes class. Suppose the patient's test results indicated a fasting blood sugar level of 10.02 and a glycated hemoglobin value of 4.3. This simple decision tree would predict that the patient does not have diabetes. 

Although there are other machine learning algorithms that predict with superior accuracy, the true strength of regression trees lies in their visual nature. We will demonstrate another example of this algorithm later on by predicting the selling price of single family homes using variables such as square footage and number of bedrooms. Here is that example, as well as two others:

**Examples:**

Say we want to predict selling prices of single family homes *(a continuous variable)*. Regression trees can predict this by examining:
* Is increased square footage related to prices for single family homes? *(continuous)*
* How much is the style of home related to the selling price of single family homes? *(categorical)*
* Zip code/county/state/etc. *(categorical)*
* Median income of neighborhood/zip code (if area variable is larger than zip code) *(continuous)*

Health (Type II Diabetes) *(categorical)*
* Does increasing sugar Consumption (avg. grams per day) related to whether or not you develop type II diabetes? *(continuous)* 
* How does increasing weight affect relate you developing type II diabetes? *(continuous)*
* Number of days per week with greater than 30 minutes of exercise *(categorical)*
* Age *(continuous)*
* Parent has diabetes *(categorical)* 
* Hours worked/week *(continuous)*

Election outcomes (voter-share) *(continuous)*
* What State/Region tends to have greater voter-share? *(categorical)*
* How much is campaign spending related to voter-share? *(continuous)*
* Incumbent *(categorical)*
* Political Party *(categorical)*
* GDP Growth *(continuous)*
* General vs. Midterm Election *(categorical)*

## Theory

Prediction trees are a particular kind of nonlinear predictive model. There are two varieties: regression trees and classification trees.  We will be focused on regression trees. Using linear regressions, we are able to make quantitative predictions. However, linear regressions do not do well with nonlinear models. A solution to this problem can be to partition the data into smaller regions that have more manageable linear interactions. We can recursively subdivide the partitions until we get extremely manageable pieces that can be estimated with simple regression models. This process is known as recursive partitioning. Hence, we use recursive partitioning to sort the data into small, manageable sections and then use a simple model for each part of the partition. 

A regression tree is a represention of the recursive partitioning process. The basic idea behind regression trees is that each good factor (variable in ML) can be used to make a "decision" about the likelihood of an outcome. Each split is called a _node_. The following diagram gives an example:

<img src="img1.png">


As you can see, the starting point of a tree is called a root node. From there, different branches take us to intermediate nodes called internal nodes, or child nodes. Branches would continue to connect us to intermediate nodes until we reach the end. The last nodes are called leaf nodes, or terminal nodes. Thus, each leaf node of the regression tree represents a part of the partition that has an estimate found using a simple model. The estimate at the leaf nodes applies only to the specific partition. 

We navigate the tree by asking a sequence of questions about specific features for some observation, $x$. Each question, usually refers to only a single attribute with a yes or no answer. For example, a question of the type could concern gender of the observation (i.e. is the observation male or not). The variables can be either continuous or discrete (but ordered). 

For classic regression trees, the model in each node is a constant estimate of $Y$. That is, suppose the points $$(x_1,y_1), (x_2,y_2), …, (x_c,y_c)$$ are all the observations belonging to the node, $z$. Then our model for $z$ is: $$\hat{y}=\frac{1}{c} \sum_{i=1}^{c}y_i$$ This is the sample mean of the dependent variable in that node. This is a piecewise-constant model.					

One of the problems with recursive partitioning is that we need to balance the informativeness of the partitions with parsimony, so as to not just put every point in its own partition. Similarly, we could just end up putting every point in its own leaf-node, which would not be very useful. A typical stopping criterion is to stop growing the tree when further splits gives less than some minimal amount of extra information, or when they would result in nodes containing less than a small percentage of the total data.  

Regression trees can be used to address problems in which we want to predict the value of a continuous variable from a set of continuous and/or categorical variables. Further, if we have enough data, we can split the data into a training and test set, allowing us to predict outcomes given new (similar) data.



### The Algorithm

The goal of the regression tree model is to make the best prediction possible. However, in the regression tree model, we are minimizing the sum of squared residuals for a given tree $T$.
The sum of squared residuals for a tree $T$ is $$S=\sum_{c\in terminal nodes(T)}\sum_{i\in C}(y_i-m_c)^2$$ where $$m_c=\frac{1}{n_c}\sum_{i\in C}y_i$$ is the prediction for leaf $c$. We make our splits to minimize $S$, subject to specified hyperparameters $q$ (the minimum amount of points allowed in each leaf) and $\delta$ (a lower bound for the largest decrease in $S$).

The Algorithm:
1. Start with a single node containing all points. Calculate $m_c$ and $S$. 
2. If all the points in the node have the same value for all the independent variables, stop. Otherwise, search over all binary splits of all variables for the one which will reduce $S$ as much as possible. IF the largest decrease in $S$ would be less than our threshold $\delta$, or one of the resulting nodes would contain less than $q$ points, stop. Otherwise, take that split, creating two new nodes.
3. In each new node, go back to step 1. 

This will create the tree that minimizes MSE across the leaves. However, this algorithm alone often leads to overfitting, which is a major concern with regression trees - you might get great scores within your training set, but then find that it generalizes poorly. As is often the case with decision trees, there is a tradeoff between bias, variance, and overfitting. The shallower the tree, the greater the bias and the variance, but this may be preferable to overfitting.

There are two potential solutions to this problem.

The first is called **pruning** -- we grow the largest tree possible before “pruning” it down. To do this, we randomly divide our data into a training set and a testing set, (say, 50% training and 50% testing). We then apply the basic tree-growing algorithm to the training data only, with $q = 1$ and $\delta = 0$ to grow the largest tree we can. At this point, there is a big overfitting problem, so we prune the tree: at each pair of terminal nodes with a common parent, we evaluate the error on the testing data, and see whether the sum of squares would be smaller by remove those two nodes and making their parent a terminal node. This process is repeated until pruning can no longer improve the error on the testing data.

The second potential solution is to use **cross-validation** to choose the hyperparameters for the model. The most common hyperparameters you can specify in a regression tree include: 

- max depth (how many levels of branches are permitted)
- max features (maximum amount of features considered when deciding each split)
- min_samples_split (minimum amount of observations per split)
- min_samples_leaf (minimum amount of observations at each leaf)
- min_weight_fraction_leaf (minimum weighted fraction of all observations per leaf node)
- max_leaf_nodes (maximum number of leaf nodes)

While both methods are effective, for the technical examples in this paper, we choose to use the cross-validation technique.

In [None]:
# Source: https://clearpredictions.com/Home/DecisionTree
Image('tree-infographic.png')

### Key Concept: Gini Impurity

For now, we will start with the simplest case: a classification problem with two outcomes. A common example uses a dataset of passengers on the Titanic to predict who survives.

Ideally, we want factors that are as predictive as possible. If men and women are equally likely to survive, the variable can't tell us much (barring interaction with other variables). Fortunately (depending on your perspective, but at least for prediction purposes), it turns out women are more likely than men to survive, so _sex_ will be an important factor in our tree.

That means that a node splitting on _sex_ has relatively low Gini impurity. Gini impurity measures the frequency of mislabeling a randomly selected element if it was randomly labeled by the distributions of labels in the subset. A factor with low impurity is very predictive of the outcome. Conversely, the impurity of a node would be maximized if equal proportions of its values (males and females here) survived.

Gini impurity is formally defined as:

$$Gini_{i} = 1 - \sum_{k=1}^{n}{p_{i,k}^2}$$

For example, if it were the case that 70% of the survivors were females, the Gini impurity of the _sex_ node would be: 

$$1-0.3^2-0.7^2 = 0.42$$

Notice that 0.42 is  probability of mislabelling 

Now suppose that 50% of the survivors were males, the Gini impurity of the _sex_ node would be: 

$$1-0.5^2-0.5^2 = 0.5$$

Notice here that the Gini impurity probability increased when splitting gender by 50%.

Finally, suppose that only 10% of the survivors were males

$$1-0.9^2-0.1^2 = 0.18$$

Here, the huge disproportionate categorization allows Gini index to be very low, suggesting very low impurity 

In general, it makes sense to grow a tree **greedily** - starting with the lowest impurest feature splits, then moving to the next lowest impurest.

If the dataset has two classes and 50% of the dataset belongs to one class and 50% to another, there is a perfect split and the Gini index is at a maximum. Conversely, if the dataset has two class and all of the instances belong to a single class, the Gini index is at at a minimum, as shown in the image below.

In [None]:
# Source: http://queirozf.com/entries/evaluation-metrics-for-classification-quick-examples-references
from IPython.display import Image
Image("https://i.imgur.com/DBxpMwl.png") 

## Under the Hood: Step-by-Step Tree

Now that we have seen a regression tree in action, we can (if time permits) step back and look at what is actually happening when we run a regression tree.

Here are the functions we will create and what they will do:
    
* test_split() -  Split a dataset based on an attribute and an attribute value
* gini_index() -  Calculate the Gini index for a split dataset
* get_split() -   Select the best split point for a dataset
* to_terminal() - Create a terminal node value
* split() -       Create child splits for a node or make terminal
* build_tree() -  Build a decision tree
* print_tree() -  Print a decision tree
* predict() -     Make a prediction with a decision tree

In [None]:
def test_split(index, value, dataset):
    '''
    split a dataset based on an attribute and an attribute value
    '''
    left, right = list(), list()
    for row in dataset:
        if row[index] < value:
            left.append(row)
        else:
            right.append(row)
    return left, right

In [None]:
def gini_index(groups, classes):
    '''
    calculate the gini index for some split
    '''
    
    # count all samples at split point
    n_instances = float(sum([len(group) for group in groups]))
    # sum weighted Gini index for each group
    gini = 0.0
    for group in groups:
        size = float(len(group))
        # avoid divide by zero
        if size == 0:
            continue
        score = 0.0
        # score the group based on the score for each class
        for class_val in classes:
            p = [row[-1] for row in group].count(class_val) / size
            score += p * p
        # weight the group score by its relative size
        gini += (1.0 - score) * (size / n_instances)
    return gini

In [None]:
def get_split(dataset):
    '''
    select the best split point for a dataset
    '''
    class_values = list(set(row[-1] for row in dataset))
    b_index, b_value, b_score, b_groups = 999, 999, 999, None
    for index in range(len(dataset[0])-1):
        for row in dataset:
            groups = test_split(index, row[index], dataset)
            gini = gini_index(groups, class_values)
        try:
            print('X%d < %.3f  =>  Gini = %.3f' % ((index+1), row[index], gini))
        except:
            print('X%s < %.3f  =>  Gini = %.3f' % ((index+1), row[index], gini))
        if gini < b_score:
            b_index, b_value, b_score, b_groups = index, row[index], gini, groups
    return {'index':b_index, 'value':b_value, 'groups':b_groups}

In [None]:
def to_terminal(group):
    '''
    create a terminal node value
    '''
    outcomes = [row[-1] for row in group]
    return max(set(outcomes), key=outcomes.count)

In [None]:
def split(node, max_depth, min_size, depth):
    '''
    create child splits for a node or make terminal
    '''
    left, right = node['groups']
    del(node['groups'])
    # check for a no split
    if not left or not right:
        node['left'] = node['right'] = to_terminal(left + right)
        return
    # check for max depth
    if depth >= max_depth:
        node['left'], node['right'] = to_terminal(left), to_terminal(right)
        return
    # process left child
    if len(left) <= min_size:
        node['left'] = to_terminal(left)
    else:
        node['left'] = get_split(left)
        split(node['left'], max_depth, min_size, depth+1)
    # process right child
    if len(right) <= min_size:
        node['right'] = to_terminal(right)
    else:
        node['right'] = get_split(right)
        split(node['right'], max_depth, min_size, depth+1) 
    return node

In [None]:
def build_tree(train, max_depth, min_size):
    '''
    takes data and two hyperparameter
    '''
    root = get_split(train)
    split(root, max_depth, min_size, 1)
    return root

In [None]:
def print_tree(node, depth=0):
    '''
    build a decision tree
    '''
    if isinstance(node, dict):
        print('%s[X%d < %.3f]' % ((depth*' ', (node['index']+1), node['value'])))
        print_tree(node['left'], depth+1)
        print_tree(node['right'], depth+1)
    else:
        print('%s[%s]' % ((depth*' ', node)))

In [None]:
def predict(node, row):
    '''
    takes a tree and uses it to predict an outcome for a given set of factors
    '''
    if row[node['index']] < node['value']:
        if isinstance(node['left'], dict):
            return predict(node['left'], row)
        else:
            return node['left']
    else:
        if isinstance(node['right'], dict):
            return predict(node['right'], row)
        else:
            return node['right']

Now we can try it out on a dataset of our own making: 

In [None]:
dataset = [
          [ 2.77, 1.78, 0],
          [ 1.72, 1.16, 0],
          [ 3.67, 2.81, 0],
          [ 3.96, 2.61, 0],
          [ 2.99, 2.20, 0],
          [ 7.49, 3.16, 1],
          [ 9.03, 3.33, 1],
          [ 7.44, 0.47, 1],
          [10.12, 3.23, 1],
          [ 6.64, 3.31, 1]
          ]

In [None]:
split_data = get_split(dataset)
print('Split: [X%d < %.3f]' % ((split_data['index']+1), split_data['value']))

In [None]:
tree = build_tree(dataset, 1, 1)
print_tree(tree)

Now we can test out the tree and see how it does.

In [None]:
for row in dataset:
    prediction = predict(tree, row)
    print('Predicted = %d,   Actual = %d' % (prediction,row[-1]))

It performs perfectly on our contrived data! What a surprise. 

The key takeaway is how it chose the better split of the two possible variables to split on, and stopped because there was a value in X1 that split the entire dataset accurately.

### Pros and Cons of Regression Trees

Advantages:				
1. Making predictions is fast, since the calculation process is not complicated (computationally efficient).
2. It’s easy to understand what variables are important in prediction (look at the tree). They are among the easiest to visualize of ML models. They are intuitive and not hard to explain, even to someone with little econometrics training.
3. If some data is missing, we might not be able to go all the way down the tree to a leaf, but we can still make a prediction by averaging all the leaves in the subtree we do reach. Further, they don't have the same problems with non-numerical or categorical data and collinearity
5. There are fast, reliable algorithms to learn these trees 	

Disadvantages:
On the downside, they often don't have the highest accuracy in prediction and can be sensitive to minor changes in data. One way to overcome these weaknesses is to use multiple decision trees aggregated (random forests, boosting) or in conjunction with other models (stacking).

## Example
#### Housing Prices: A Kaggle Dataset

Here’s an example using regression tree. Suppose you are interested in predicting home prices based on home characteristics. This could be because you are constructing, buying, or selling a home, and are looking for a ballpark price range based on home characteristics. Or perhaps you are investing in real estate and would like data to decide whether or not the home is above or below average price given its specific features. Alternatively, you could be interested in home value appraisal for taxation purposes. Your fundamental question: given individual housing characteristics, how much will this home sell for? 


Here, we have a dataset from Kaggle that includes all the relevant information to answer this question. The dataset has detailed information on a large number of housing characteristics and the sale price. Now, we can use regression tree to predict future home sale values. The following notebook demonstrates how to construct such a regression tree.



The training dataset contains 1460 observations and 80 features. Let's start by calling packages needed for our analysis.

In [3]:
# Core Packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# ML Packages
from sklearn.linear_model import SGDRegressor, ElasticNetCV
from sklearn.metrics import mean_squared_error, make_scorer, f1_score, classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split, learning_curve, RandomizedSearchCV, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

# ML Packages
from sklearn.metrics import mean_squared_error, make_scorer, f1_score, classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, learning_curve, RandomizedSearchCV, GridSearchCV
from sklearn.tree import DecisionTreeRegressor

# Other Packages
import graphviz 
from sklearn import tree

ModuleNotFoundError: No module named 'graphviz'

After downloading necessary packages, we divide our sample into our training and test datasets.

In [None]:
train_location = "train.csv"
test_location = "test.csv"

train = pd.read_csv(train_location)
test = pd.read_csv(test_location)

To better understand what our data looks like, we look at a small subset of the training data to understand our data format.

In [None]:
train.head()

We'll remove 'SalePrice' and 'Id' from the training dataset and log-transform 'SalePrice', which is our target variable of interest. 

In [None]:
target = train['SalePrice']
target_transformed = np.log(target)

train = train.drop(['SalePrice', 'Id'], axis = 1)

### 1. Data Cleaning 

Before we start cleaning, let's develop a better understanding of what the data looks like. It looks like we have information about almost every aspect of a home (and its surrounding property) you could imagine, from commonly cited measures like square-feet and number of bedrooms to more detailed  metrics like the height of the basement or the masonry veneer type. Note that the final column is 'SalePrice', which is the variable we seek to predict. 

Below is a categorization of the features within the following categories: Sales, General, Location, Property, Interior, Basement, Utilities, Garage, and Exterior. This categorization is a subjective exercise, but it allowed me to become more familiar with the features and create general buckets within the dataset.  

**Sale**
- SalePrice: the property's sale price in dollars
- MoSold: Month Sold
- YrSold: Year Sold
- SaleType: Type of sale
- SaleCondition: Condition of sale

**General**
- MSSubClass: The building class
- MSZoning: The general zoning classification
- BldgType: Type of dwelling
- HouseStyle: Style of dwelling
- OverallQual: Overall material and finish quality
- OverallCond: Overall condition rating
- YearBuilt: Original construction date
- YearRemodAdd: Remodel date
- MiscFeature: Miscellaneous feature not covered in other categories
- MiscVal: Dollar Value of miscellaneous feature

**Location**
- Street: Type of road access
- Alley: Type of alley access
- Neighborhood: Physical locations within Ames city limits
- Condition1: Proximity to main road or railroad
- Condition2: Proximity to main road or railroad (if a second is present)
- LotFrontage: Linear feet of street connected to property

**Property**
- LotArea: Lot size in square feet
- LotShape: General shape of property
- LandContour: Flatness of the property
- LotConfig: Lot configuration
- LandSlope: Slope of property

**Interior**
- 1stFlrSF: First Floor square feet
- 2ndFlrSF: Second floor square feet
- LowQualFinSF: Low quality finished square feet (all floors)
- GrLivArea: Above grade (ground) living area square feet
- FullBath: Full bathrooms above grade
- HalfBath: Half baths above grade
- Bedroom: Number of bedrooms above basement level
- Kitchen: Number of kitchens
- KitchenQual: Kitchen quality
- TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
- Functional: Home functionality rating
- Fireplaces: Number of fireplaces
- FireplaceQu: Fireplace quality

**Basement**
- BsmtQual: Height of the basement
- BsmtCond: General condition of the basement
- BsmtExposure: Walkout or garden level basement walls
- BsmtFinType1: Quality of basement finished area
- BsmtFinSF1: Type 1 finished square feet
- BsmtFinType2: Quality of second finished area (if present)
- BsmtFinSF2: Type 2 finished square feet
- BsmtUnfSF: Unfinished square feet of basement area
- TotalBsmtSF: Total square feet of basement area
- BsmtFullBath: Basement full bathrooms
- BsmtHalfBath: Basement half bathrooms

**Utilities**
- Utilities: Type of utilities available
- Heating: Type of heating
- HeatingQC: Heating quality and condition
- CentralAir: Central air conditioning
- Electrical: Electrical system

**Garage**
- GarageType: Garage location
- GarageYrBlt: Year garage was built
- GarageFinish: Interior finish of the garage
- GarageCars: Size of garage in car capacity
- GarageArea: Size of garage in square feet
- GarageQual: Garage quality
- GarageCond: Garage condition

**Exterior**
- RoofStyle: Type of roof
- RoofMatl: Roof material
- Exterior1st: Exterior covering on house
- Exterior2nd: Exterior covering on house (if more than one material)
- MasVnrType: Masonry veneer type
- MasVnrArea: Masonry veneer area in square feet
- ExterQual: Exterior material quality
- ExterCond: Present condition of the material on the exterior
- Foundation: Type of foundation
- PavedDrive: Paved driveway
- WoodDeckSF: Wood deck area in square feet
- OpenPorchSF: Open porch area in square feet
- EnclosedPorch: Enclosed porch area in square feet
- 3SsnPorch: Three season porch area in square feet
- ScreenPorch: Screen porch area in square feet
- PoolArea: Pool area in square feet
- PoolQC: Pool quality
- Fence: Fence quality

Note that these features are a mix of continuous (Lot Area, Year Built, Bedrooms) and categorical (House Style, Roof Style, Garage Type) variables. 

Let's start cleaning by checking for missing values. Below we find the number of missing values for each feature, for features with missing values. 

In [None]:
# Find the number of missing values for each feature, including only those greater than 0. 
missing_values = pd.DataFrame(train.isnull().sum())
missing_values = missing_values[(missing_values > 0).any(axis=1)]

# Sort the values in ascending order. 
missing_values = missing_values.sort_values(by = 0, ascending = False)
missing_values.columns = ['Number of Missing Values']

# Calculate 'Percent Missing'
missing_values['Percent Missing'] = missing_values['Number of Missing Values']/len(train)
missing_values

19 of the 80 features are missing 1 or more value. However, the degree to which values are missing varies widely across the 19 variables. Only 7 of the 1460 properties have information about pool quality ('PoolQC') while only 1 property is missing information about the property's electrical system. 

We'll drop 'Alley', 'FireplaceQu', 'PoolQC', 'PoolArea', 'Fence', and 'MiscFeature' from our dataset, since most observations do not have information for those variables.

In [None]:
train = train.drop(['MiscFeature', 'Fence', 'PoolQC', 'PoolArea', 'FireplaceQu', 'Alley'], axis = 1)

What about the others? Let's fill them in with the average of the feature if the feature is continuous or with the mode if the feature is categorical. 

In [None]:
for feature in train:
   # Features with a 'dtype' of O are categorical 
    if train[feature].dtype == 'O':
       train[feature] = train[feature].fillna(train[feature].mode()[0])

for feature in train:
   # Features with a 'dtype' of i or are floats are continuous
    if train[feature].dtype == np.float64 or train[feature].dtype == 'i':
       train[feature] = train[feature].fillna(train[feature].mean())

Let's confirm there aren't any remaining missing values.

In [None]:
# Should return 'False'
train.isnull().any().any()

We next look at outliers. To start, we'll explicitly determine which of our features are categorical and which are continuous.

In [None]:
# Create two empty lists
continuous_features = []
categorical_features = []

# Seperate features by dtype
for feature in train.columns:
    if train[feature].dtype == "object":
        categorical_features.append(feature)
    else:
        continuous_features.append(feature)
        
print("Number of Continuous Features:", len(continuous_features), "\nNumber of Categorical Features:", len(categorical_features))

We'll use this to filter outliers according to a simple rule: 

For each column we compute the z-score of each value in the column relative to the column mean and standard deviation. Since the direction of the difference is irrelevant, we take the absolute value. Here we remove rows that contain a (continuous) feature value greater than 5 standard deviations away from the standardized mean. 

This code below was adapted from [this](https://stackoverflow.com/questions/23199796/detect-and-exclude-outliers-in-pandas-dataframe) Stack Overflow article.

In [None]:
n_std = 5
len(train) - len(train[train[continuous_features].apply(lambda x: np.abs(x - x.mean()) / x.std() < n_std).all(axis=1)])

In doing so we drop 86 rows of our training data. We can adjust this threshold later to see if it affects our mean squared error.

In [None]:
# Drop rows in training set (and target) according to the rule described above
target_transformed = target_transformed[train[continuous_features].apply(lambda x: np.abs(x - x.mean()) / x.std() < 10).all(axis=1)]
train = train[train[continuous_features].apply(lambda x: np.abs(x - x.mean()) / x.std() < 10).all(axis=1)]

The final step of the cleaning process is to create dummy variables for the categorical features. 

In [None]:
train_no_dummies = train
train = pd.get_dummies(train)

We apply the same changes we made, cleaning missing values, checking for outliers, and getting dummies for the test data.

In [None]:
test = test.drop(['MiscFeature', 'Fence', 'PoolQC', 'FireplaceQu', 'Alley'], axis = 1)

for feature in test:
    # Features with a 'dtype' of O are categorical 
   if test[feature].dtype == 'O':
       test[feature] = test[feature].fillna(test[feature].mode()[0])
for feature in test:
    # Features with a 'dtype' of i or are floats are continuous
    if test[feature].dtype == np.float64 or test[feature].dtype == 'i':
       test[feature] = test[feature].fillna(test[feature].mean())

# Only keep columns in test that are also found in train
test = test.reindex(columns = train.columns, fill_value=0)

test_no_dummies = test
test = pd.get_dummies(test)

### 2. Data Exploration & Visualization 

With the data cleaned we're now ready to explore the data. We begin by calculating the correlations for all of the continuous features and ranking them from -1 to 1.

In [None]:
# Filter out categorical variables
values = []
df = train[continuous_features]

# Iterate over each continous feature and calcualte its correlation with the target
for feature in df.columns:
    values.append([feature, df[feature].corr(target_transformed)])
    
# Sort the values and present them in a Pandas Dataframe
values = sorted(values, key=lambda x: x[1])
correlations = pd.DataFrame(values, columns = ['Feature', 'Correlation with SalePrice'])
correlations.tail()

It looks like 'OverallQual', 'GrLivArea' 'GarageCars', 'GarageArea', 'TotalBsmtSF' and '1stFlrSF' are moderately to highly correlated with 'SalePrice'. We'll need to examine the coefficients on these features when we do model fitting later on.

### 3. Model Fitting & Evaluation

In our modeling, we will use the technigues described in the analytical framework to estimate a regression tree model that estimates the log of hose prices for homes in the Kaggle dataset. We will estimate both a full regression tree that includes all possible variables, as well as a simpler regression tree that gives a better visual and conceptual representation of how regression trees work. 

#### Small Regression Tree

We first estimate a tree with a subset of the variables to show a concise and easy to understand example of how regression trees work. In particular, we use a subset of variables that are likely to be the most salient for homebuyers to consider when purchasing a home. The variables we include are total square footage, overall quality, overall condition, lot size, the year the home was built, as well as the number of bedrooms and bathrooms. We subset both the training data and the test data by these variables.

In [None]:
# Prepare the training data
s_train = train_no_dummies
s_train['TotalSF'] = train_no_dummies['TotalBsmtSF'] + train_no_dummies['1stFlrSF'] + train_no_dummies['2ndFlrSF']
s_train = s_train[['TotalSF', 'OverallQual', 'OverallCond', 'LotArea', 'YearBuilt', 'BedroomAbvGr', 'FullBath', 'HalfBath']]

# Prepare the test data
s_test = test_no_dummies
s_test['TotalSF'] = test_no_dummies['TotalBsmtSF'] + test_no_dummies['1stFlrSF'] + test_no_dummies['2ndFlrSF']
s_test = s_test[['TotalSF', 'OverallQual', 'OverallCond', 'LotArea', 'YearBuilt', 'BedroomAbvGr', 'FullBath', 'HalfBath']]

We first scale the test data and the training data to prepare it for our analysis.

In [None]:
# Scale the training data
scaler = StandardScaler()
scaler.fit(s_train)
scaled_s_train_df = scaler.transform(s_train)

# Scale the test data
scaler.fit(s_test)
scaled_s_test_df = scaler.transform(s_test)

We fit our model with our training data to a regression tree with a maximum depth of 3, and we generate a set of predictions for both our training data and our test data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(s_train, target_transformed, test_size=0.33, random_state=42)

clf = DecisionTreeRegressor(max_depth = 3)  
clf = clf.fit(X_train, y_train)
train_predictions = clf.predict(X_train)
test_predictions = clf.predict(X_test)

We create a visual representation of this simplified regression tree using the "graphviz" package. The visualization shows the leaves and branches of our regression tree model. 

In [None]:
#http://scikit-learn.org/stable/modules/generated/sklearn.tree.export_graphviz.html
dot_data = tree.export_graphviz(clf, out_file = None, feature_names = s_train.columns, label = 'root', filled = True, impurity = True, proportion = True, rounded = True)
graph = graphviz.Source(dot_data)  
graph

We also report a portion of our results, including the actual sale price and the predicted sale price for the first 20 observations in our dataset. 

#### Full Regression Tree

Now using the complete dataset, we again begin by scaling the training and test data and confirming that the matrices have the correct shape. If the data sets are correctly shaped, the training and test sets should have the same number of columns.

In [None]:
# Scale the training data
scaler = StandardScaler()
scaler.fit(train)
scaled_train_df = scaler.transform(train)

# Scale the test data
scaler.fit(test)
scaled_test_df = scaler.transform(test)

If the data sets are correctly shaped, the training and test sets should have the same number of columns

In [None]:
print(target_transformed.shape, scaled_train_df.shape, scaled_test_df.shape)

We prepare our testing and training datasets for our regression tree model, and we determine the best parameters to use in our analysis using a grid search cross-validation method. We also fit our model to the training data, and generate predictions for both the training data and the test data. We also report what hyperparameters we use based on our cross validation.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(scaled_train_df, target_transformed, test_size=0.33, random_state=42)

param_dist = {"min_samples_leaf": [3, 5, 8], "max_depth": [15, 20, 25, 30]}
model = DecisionTreeRegressor()
dt = GridSearchCV(model, param_grid=param_dist, scoring='neg_mean_squared_error')

dt.fit(X_train, y_train)
dt_train_predictions = dt.predict(X_train)
dt_test_predictions = dt.predict(X_test)
print("Best Params: {}".format(dt.best_params_))

In [None]:
results = pd.DataFrame({"Actual": np.exp(y_train), "Predicted": list(np.exp(clf.predict(X_train)))})
results = results.reset_index(drop=True)
results.head(20)

## References

Reference 1: Page 325 describes Regression Tree
     http://web.b.ebscohost.com/ehost/pdfviewer/pdfviewer?vid=1&sid=76d6816d-9bab-4ef7-b763-0e3a8872b6fe%40sessionmgr103

Link 2: http://web.b.ebscohost.com/ehost/detail/detail?vid=0&sid=76d6816d-9bab-4ef7-b763      
     0e3a8872b6fe%40sessionmgr103&bdata=JnNpdGU9ZWhvc3QtbGl2ZSZzY29wZT1zaXRl#AN=2009-22665-002&db=pdh

Strobl, C.; Malley, J.; Tutz, G. (2009). "An Introduction to Recursive Partitioning: Rationale, Application and Characteristics      of Classification and Regression Trees, Bagging and Random Forests". Psychological Methods. 14 (4): 323–348.                    doi:10.1037/a0016973.

Reference 2: Chipman, Hugh A., Edward I. George, and Robert E. McCulloch. "Bayesian CART model search." Journal of the American      Statistical Association 93.443 (1998): 935-948. https://search.proquest.com/docview/274825524?pq-origsite=gscholar

Reference 3: “Decision Tree Learning.” Wikipedia, Wikimedia Foundation, 12 Apr. 2018, 
     en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity.