# How models work

To understand how models work we can take a very commonly used dataset and a model that is intuitive to understand. 

The dataset we are going to use will be the [Housing Prices dataset](https://www.kaggle.com/c/home-data-for-ml-course/data), containing 80 features and a final price for 1460 houses.

We will start with a model called [Decision Tree](https://en.wikipedia.org/wiki/Decision_tree).

But first, shall we have a look at those features?

In [5]:
# Import pandas
import pandas as pd

# Load the dataset, located in ../data/housing/train.csv
df = pd.read_csv('../data/housing/train.csv')

# print the first 5 records of the dataset
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## Decision Tree

For simplicity, we'll start with the simplest possible decision tree:

![Simple Decision Tree](../data/misc/simple_decision_tree.png)

It divides houses into only two categories. The predicted price for any house under consideration is the historical average price of houses in the same category.

We use data to decide how to break the houses into two groups, and then again to determine the predicted price in each group. This step of capturing patterns from data is called **fitting** or **training** the model. The data used to fit the model is called the **training data**.

We will save the details about how the model is trained for later. After the model has been fit, you can apply it to new data to predict prices of additional homes.

## Improving the Decision Tree
Which one of the following models you think it's better?

![2 decision trees](../data/misc/2_decision_trees.png)

The decision tree on the left (Decision Tree 1) probably makes more sense, because it captures the reality that houses with more bedrooms tend to sell at higher prices than houses with fewer bedrooms. The biggest shortcoming of this model is that it doesn't capture most factors affecting home price, like number of bathrooms, lot size, location, etc.

You can capture more factors using a tree that has more "splits." These are called "deeper" trees. A decision tree that also considers the total size of each house's lot might look like this:

![Complex tree](../data/misc/complex_tree.png)

You predict the price of any house by tracing through the decision tree, always picking the path corresponding to that house's characteristics. The predicted price for the house is at the bottom of the tree. The point at the bottom where we make a prediction is called a **leaf**.

The splits and values at the leaves will be determined by the data, so it's time for you to check out the data you will be working with.

## Exploring the data
The first step in any machine learning project is familiarize yourself with the data. You'll use the Pandas library for this. Pandas is the primary tool data scientists use for exploring and manipulating data.

So once you have loaded your data into a variable called `df` (or however you want to call it, really), try to answer the following questions

**1) What is the average lot size?**

_tip: you can apply statistical functions like `mean`, `min`, etc. to any numerical column in your DataFrame_

In [4]:
avg_lot_size = df.LotArea.mean() # your answer here 

print(f'Avg lot size is: {avg_lot_size}')

Avg lot size is: 10516.828082191782


**2) As of today, how old is the newest home (current year - the date in which it was built)**

In [8]:
age_newest_house = (2020 - df.YearBuilt).min() # your answer here

print(f'The newest home is {age_newest_house} years old')

The newest home is 10 years old


The newest house in your data isn't that new.  A few potential explanations for this:

1. They haven't built new houses where this data was collected.
2. The data was collected a long time ago. Houses built after the data publication wouldn't show up.

If the reason is explanation #1 above, does that affect your trust in the model you build with this data? What about if it is reason #2?

How could you dig into the data to see which explanation is more plausible?