## Scenario

Your cousin has made millions of dollars speculating on real estate. He's offered to become business partners with you because of your interest in data science. He'll supply the money, and you'll supply models that predict how much various houses are worth.

You ask your cousin how he's predicted real estate values in the past. and he says it is just intuition. But more questioning reveals that he's identified price patterns from houses he has seen in the past, and he uses those patterns to make predictions for new houses he is considering.

Machine learning works the same way.We'll start with a model called the Decision Tree. There are fancier models that give more accurate predictions. But decision trees are easy to understand, and they are the basic building block for some of the best models in data science.

In general any machine learning work flow can be repersented as the following:

<img src="../../images/ML-Pipeline.jpg" style="width:400px"/><br><br>

In this tutorial we will show how to train, predict, and validate your model!

## What is a Random forest?

### Based on a decision trees

Classification and Regression Trees or CART for short is a term introduced by Leo Breiman to refer to Decision Tree algorithms that can used for classification or regression predictive modeling problems.

The aim at each stage is to associate specific targets (i.e., desired output values) with specific values of a particular variable.The result is a decision-tree in which each path identifies a combination of values associated with a particular prediction.

Each non-leaf node in this tree is basically a decision maker. These nodes are called decision nodes. Each node carries out a specific test to determine where to go next. Depending on the outcome, you either go to the left branch or the right branch of this node. We keep doing this until we reach a leaf node. If we are constructing a classifier, each leaf node represents a class. For example, a simple decision tree is shown below.

<img src="../../images/SimpleDT.JPG" style="width:400px"/><br><br>

It divides houses into only two categories. The predicted price for any house under consideration is the historical average price of houses in the same category.

We use data to decide how to break the houses into two groups, and then again to determine the predicted price in each group. This step of capturing patterns from data is called fitting or training the model. The data used to <b>fit</b> the model is called the <b>training data</b>.

After the model has been fit, you can apply it to new data to predict prices of additional homes.

You can capture more factors using a tree that has more "splits." These are called "deeper" trees. A decision tree that also considers the total size of each house's lot might look like this:

<img src="../../images/SimpleDT2.JPG" style="width:400px"/><br><br>



You predict the price of any house by tracing through the decision tree, always picking the path corresponding to that house's characteristics. The predicted price for the house is at the bottom of the tree. The point at the bottom where we make a prediction is called a leaf.

### Details on splitting

We use the Gini Index as our cost function used to evaluate splits in the dataset. We minimize it.

![alt text](http://i.imgur.com/IijgHbt.png "Logo Title Text 1")

A split in the dataset involves one input attribute and one value for that attribute. It can be used to divide training patterns into two groups of rows.

A Gini score gives an idea of how good a split is by how mixed the classes are in the two groups created by the split. A perfect separation results in a Gini score of 0, whereas the worst case split that results in 50/50 classes in each group results in a Gini score of 1.0 (for a 2 class problem).

![alt text](https://image.slidesharecdn.com/decisiontree-151015165353-lva1-app6892/95/classification-using-decision-tree-41-638.jpg?cb=1444928106 "Logo Title Text 1")

A Gini score gives an idea of how good a split is by how mixed the classes are in the two groups created 
by the split. A perfect separation results in a Gini score of 0, whereas the worst case split that 
results in 50/50 classes. We calculate it for every row and split the data accordingly in our binary tree. We repeat this process recursively. 

![alt text](https://image.slidesharecdn.com/decisiontree-151015165353-lva1-app6892/95/classification-using-decision-tree-14-638.jpg?cb=1444928106
 "Logo Title Text 1")
 
Using decision trees, we can build a random forest
 
One problem that might occur with one big (deep) single DT is that it can overfit. That is the DT can “memorize” the training set the way a person might memorize an Eye Chart.

The point of RF is to prevent overfitting. It does this by creating random subsets of the features and building smaller (shallow) trees using the subsets and then it combines the subtrees.

The downside of RF is it can be slow if you have a single process but it can be parallelized.

### Majority Vote
![alt text](https://i.ytimg.com/vi/ajTc5y3OqSQ/hqdefault.jpg "Logo Title Text 1")

### Subset of data
![alt text](https://mapr.com/blog/predicting-loan-credit-risk-using-apache-spark-machine-learning-random-forests/assets/blogimages/sparkmlrandomforest.png "Logo Title Text 1")

We can treat real estate problem as a regression problem (time series) or classification

Ignore the forest part for a moment, even a single tree can do regression. Each leaf holds a prediction value, which no longer is a class for regression. Given an input feature vector, you simply walk the tree as you'd do for a classification problem, and the resulting value in the leaf node is the prediction.


## Evaluating the data

The first step in any machine learning project is familiarize yourself with the data. You'll use the **Pandas** library for this. Pandas is the primary tool data scientists use for exploring and manipulating data. Most people abbreviate pandas in their code as pd. We do this with the command


In [None]:
# import pandas and abbreviate it as pd
import pandas as pd

# Path of the file to read
file_path = '../data/train.csv'

# Fill in the line below to read the file into a variable home_data
home_data = ____


Pandas describe() method is a powerful tool for exploring the data 

Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided

In [None]:
# Print summary statistics in next line
_


In [None]:
# What is the average lot size (rounded to nearest integer)?
avg_lot_size = _


# As of today, how old is the newest home (current year - the date in which it was built)
newest_home_age = _


#Hint: Run the describe command. Lot size is in the column called LotArea. Also look at YearBuilt. Remember to round lot size

# Answers: 
#avg_lot_size = 10517
#newest_home_age = 9

## Keep Going
We have explored some of pandas capabilities, lets move on!
**[Data Selection](DataSelection.ipynb)**
