<a href="https://colab.research.google.com/github/hwanin99/ML1_Class/blob/main/5_randomforest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Selecting Data for Modeling
Your dataset had  too many variables to wrap your head around, or even to print out nicely.  How can you pare down this overwhelming amount of data to something you can understand?

We'll start by picking a few variables using our intuition. Later courses will show you statistical techniques to automatically prioritize variables.

To choose variables/columns, we'll need to see a list of all columns in the dataset. That is done with the **columns** property of the DataFrame (the bottom line of code below).


In [1]:
import pandas as pd

melbourne_file_path = '/content/sample_data/california_housing_train.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value'],
      dtype='object')

In [2]:
melbourne_data

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.40,19.0,7650.0,1901.0,1129.0,463.0,1.8200,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.9250,65500.0
...,...,...,...,...,...,...,...,...,...
16995,-124.26,40.58,52.0,2217.0,394.0,907.0,369.0,2.3571,111400.0
16996,-124.27,40.69,36.0,2349.0,528.0,1194.0,465.0,2.5179,79000.0
16997,-124.30,41.84,17.0,2677.0,531.0,1244.0,456.0,3.0313,103600.0
16998,-124.30,41.80,19.0,2672.0,552.0,1298.0,478.0,1.9797,85800.0


In [3]:
melbourne_data.shape

(17000, 9)

In [4]:
# The Melbourne data has some missing values (some houses for which some variables weren't recorded.)
# We'll learn to handle missing values in a later tutorial.  
# Your Iowa data doesn't have missing values in the columns you use. 
# So we will take the simplest option for now, and drop houses from our data. 
# Don't worry about this much for now, though the code is:

# dropna drops missing values (think of na as "not available")
melbourne_data = melbourne_data.dropna(axis=0)

In [5]:
melbourne_data.shape

(17000, 9)

There are many ways to select a subset of your data. The [Pandas course](https://www.kaggle.com/learn/pandas) covers these in more depth, but we will focus on two approaches for now.

1. Dot notation, which we use to select the "prediction target"
2. Selecting with a column list, which we use to select the "features" 

## Selecting The Prediction Target 
You can pull out a variable with **dot-notation**.  This single column is stored in a **Series**, which is broadly like a DataFrame with only a single column of data.  

We'll use the dot notation to select the column we want to predict, which is called the **prediction target**. By convention, the prediction target is called **y**. So the code we need to save the house prices in the Melbourne data is

In [6]:
y = melbourne_data.median_house_value

In [7]:
y

0         66900.0
1         80100.0
2         85700.0
3         73400.0
4         65500.0
           ...   
16995    111400.0
16996     79000.0
16997    103600.0
16998     85800.0
16999     94600.0
Name: median_house_value, Length: 17000, dtype: float64

In [8]:
type(y)

pandas.core.series.Series

In [9]:
type(melbourne_data)

pandas.core.frame.DataFrame

# Choosing "Features"
The columns that are inputted into our model (and later used to make predictions) are called "features." In our case, those would be the columns used to determine the home price. Sometimes, you will use all columns except the target as features. Other times you'll be better off with fewer features. 

For now, we'll build a model with only a few features. Later on you'll see how to iterate and compare models built with different features.

We select multiple features by providing a list of column names inside brackets. Each item in that list should be a string (with quotes).

Here is an example:

In [10]:
melbourne_data.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value'],
      dtype='object')

In [11]:
melbourne_features = ['total_rooms', 'total_bedrooms', 'population', 'latitude', 'longitude']

By convention, this data is called **X**.

In [12]:
X = melbourne_data[melbourne_features]

Let's quickly review the data we'll be using to predict house prices using the `describe` method and the `head` method, which shows the top few rows.

In [13]:
X.describe()

Unnamed: 0,total_rooms,total_bedrooms,population,latitude,longitude
count,17000.0,17000.0,17000.0,17000.0,17000.0
mean,2643.664412,539.410824,1429.573941,35.625225,-119.562108
std,2179.947071,421.499452,1147.852959,2.13734,2.005166
min,2.0,1.0,3.0,32.54,-124.35
25%,1462.0,297.0,790.0,33.93,-121.79
50%,2127.0,434.0,1167.0,34.25,-118.49
75%,3151.25,648.25,1721.0,37.72,-118.0
max,37937.0,6445.0,35682.0,41.95,-114.31


In [14]:
X.head()

Unnamed: 0,total_rooms,total_bedrooms,population,latitude,longitude
0,5612.0,1283.0,1015.0,34.19,-114.31
1,7650.0,1901.0,1129.0,34.4,-114.47
2,720.0,174.0,333.0,33.69,-114.56
3,1501.0,337.0,515.0,33.64,-114.57
4,1454.0,326.0,624.0,33.57,-114.57


Visually checking your data with these commands is an important part of a data scientist's job.  You'll frequently find surprises in the dataset that deserve further inspection.

---
# Building Your Model

You will use the **scikit-learn** library to create your models.  When coding, this library is written as **sklearn**, as you will see in the sample code. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames. 

The steps to building and using a model are:
* **Define:** What type of model will it be?  A decision tree?  Some other type of model? Some other parameters of the model type are specified too.
* **Fit:** Capture patterns from provided data. This is the heart of modeling.
* **Predict:** Just what it sounds like
* **Evaluate**: Determine how accurate the model's predictions are.

Here is an example of defining a decision tree model with scikit-learn and fitting it with the features and target variable.

In [15]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

Many machine learning models allow some randomness in model training. Specifying a number for `random_state` ensures you get the same results in each run. This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.

We now have a fitted model that we can use to make predictions.

In practice, you'll want to make predictions for new houses coming on the market rather than the houses we already have prices for. But we'll make predictions for the first few rows of the training data to see how the predict function works.


In [16]:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

Making predictions for the following 5 houses:
   total_rooms  total_bedrooms  population  latitude  longitude
0       5612.0          1283.0      1015.0     34.19    -114.31
1       7650.0          1901.0      1129.0     34.40    -114.47
2        720.0           174.0       333.0     33.69    -114.56
3       1501.0           337.0       515.0     33.64    -114.57
4       1454.0           326.0       624.0     33.57    -114.57
The predictions are
[66900. 80100. 85700. 73400. 65500.]


# Your Turn
Try it out yourself in the **[Model Building Exercise](https://www.kaggle.com/kernels/fork/1404276)**

---




*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/intro-to-machine-learning/discussion) to chat with other learners.*

In [17]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

0.0

In [18]:
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y= train_test_split(X, y, random_state = 0)

melbourne_model = DecisionTreeRegressor()
melbourne_model.fit(train_X, train_y)

val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

42758.021411764705


In [19]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
  model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state = 0)
  model.fit(train_X, train_y)
  preds_val = model.predict(val_X)
  mae = mean_absolute_error(val_y, preds_val)

  return mae

In [20]:
for max_leaf_nodes in [100, 500, 700, 1000]:
  my_mae = get_mae( max_leaf_nodes, train_X, val_X, train_y, val_y )
  print('Max leaf nodes: %d  \t\t MAE: %d' %(max_leaf_nodes, my_mae))

Max leaf nodes: 100  		 MAE: 49365
Max leaf nodes: 500  		 MAE: 43379
Max leaf nodes: 700  		 MAE: 42634
Max leaf nodes: 1000  		 MAE: 41912


In [21]:
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor(n_estimators = 1000)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

32509.20510705882


In [26]:
output = pd.DataFrame( {'SalePrice': melb_preds})

output.to_csv('submission.csv',index=True)