# Our first ML model

As we saw on the previous notebook, there are about 80 different variables (features) we can use to train our model. Which ones should we use? There are techniques to help with this **feature selection**. For now, let's start by using our intuition, and start with a small selection of features that we think will help the model learn.

In [2]:
# Don't modify

import pandas as pd

df = pd.read_csv('../data/housing/train.csv')

Use the attribute `columns` in the DataFrame to print the list of available features:

In [6]:
columns = ", ".join(df.columns) # your answer here
print(f'Available features: {columns}')

Available features: Id, MSSubClass, MSZoning, LotFrontage, LotArea, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, OverallQual, OverallCond, YearBuilt, YearRemodAdd, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, MasVnrArea, ExterQual, ExterCond, Foundation, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, Heating, HeatingQC, CentralAir, Electrical, 1stFlrSF, 2ndFlrSF, LowQualFinSF, GrLivArea, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, KitchenQual, TotRmsAbvGrd, Functional, Fireplaces, FireplaceQu, GarageType, GarageYrBlt, GarageFinish, GarageCars, GarageArea, GarageQual, GarageCond, PavedDrive, WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch, PoolArea, PoolQC, Fence, MiscFeature, MiscVal, MoSold, YrSold, SaleType, SaleCondition, SalePrice


## Selecting features
The columns that are inputted into our model (and later used to make predictions) are called **features**. In our case, those would be the columns used to determine the home price. Sometimes, you will use all columns except the target as features. Other times you'll be better off with fewer features.

We can start simple by selecting a small subset of all the above features. A subset of data that could make sense is:

In [13]:
features = [
    'LotArea',
    'YearBuilt',
    '1stFlrSF',
    '2ndFlrSF',
    'FullBath',
    'BedroomAbvGr',
    'TotRmsAbvGrd'
]

These features are, respectively:

* The size in square feet
* The year when the house was built
* First floor square feet
* Second floor square feet
* Full bathrooms above grade
* Bedrooms above grade (does NOT include basement bedrooms)
* Total rooms above grade (does not include bathrooms)

These features should have enough **predictive power** on the final price. That is what we want to predict, and what we call **target** variable. Looking at all the columns, which one do you think is our target?

In [10]:
target = 'SalePrice' # your answer here

Create two new variables: 

* `X`: A DataFrame with _only_ the subset of selected columns
* `y`: A DataFrame with _only_ the target variable

In [15]:
X = df[features] # your answer here
y = df[target] # your answer here

**Working with missing data**

For simplicity, we have selected a list of columns that do **not** have any missing value.

However, we saw before that some columns had mising values (`NaN`s, values that were not recorder). Algorithms don't understand what a `NaN` is, so we need to do something with those samples. There are several options:

1. Average the value over the rest of the sample or any other statistic (median, min, etc)
2. Try to infer the value from other columns (if that would be a derived value)
3. Drop samples with `NaN` values.

Depending of the feature and amount of missing values, one option may be better than another. For now though, we don't have to worry about this

**Review the data**

Before continuing further, review X to make sure it makes sense.

Print the descriptive statistics of X:

In [18]:
 X.describe() # your answer here

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,10516.828082,1971.267808,1162.626712,346.992466,1.565068,2.866438,6.517808
std,9981.264932,30.202904,386.587738,436.528436,0.550916,0.815778,1.625393
min,1300.0,1872.0,334.0,0.0,0.0,0.0,2.0
25%,7553.5,1954.0,882.0,0.0,1.0,2.0,5.0
50%,9478.5,1973.0,1087.0,0.0,2.0,3.0,6.0
75%,11601.5,2000.0,1391.25,728.0,2.0,3.0,7.0
max,215245.0,2010.0,4692.0,2065.0,3.0,8.0,14.0


Print the top 5 rows:

In [19]:
X.head()

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
0,8450,2003,856,854,2,3,8
1,9600,1976,1262,0,2,3,6
2,11250,2001,920,866,2,3,6
3,9550,1915,961,756,1,3,7
4,14260,2000,1145,1053,2,4,9


## Building our model
You will use the `scikit-learn` library to create your models. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.

The steps to building and using a model are:

* **Define**: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
* **Fit**: Capture patterns from provided data. This is the heart of modeling.
* **Predict**: Just what it sounds like
* **Evaluate**: Determine how accurate the model's predictions are.

Let's try this with our subset of the data!

First of all, we need to import the class `DecisionTreeRegressor` from `sklearn.tree` module:

In [20]:
from sklearn.tree import DecisionTreeRegressor # your answer here

Then, we instanciate the model, and fit it. In order to train, we need to provide the model with two parameters: A list of samples and their true value `(X, y)` 

In [21]:
model = DecisionTreeRegressor() # your answer here

model.fit(X, y) # fit the model

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

What do you think all those parameters are?

Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.

We now have a fitted model that we can use to make predictions.

In practice, you'll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But we'll make predictions for the first few rows of the training data to see how the predict function works.

In [26]:
first_5 = X.head()
predictions = model.predict(first_5) # your answer here
first_5

Unnamed: 0,LotArea,YearBuilt,1stFlrSF,2ndFlrSF,FullBath,BedroomAbvGr,TotRmsAbvGrd
0,8450,2003,856,854,2,3,8
1,9600,1976,1262,0,2,3,6
2,11250,2001,920,866,2,3,6
3,9550,1915,961,756,1,3,7
4,14260,2000,1145,1053,2,4,9


In [27]:
print(f'Predictions are {predictions}')

Predictions are [208500. 181500. 223500. 140000. 250000.]


Use the `head` method to compare your predictions with the first 5 true values in `y`. Anything surprising?

In [28]:
# Print the first 5 true values
y.head()

0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64