# Introduction
**This will be your workspace for the [Machine Learning course](https://www.kaggle.com/learn/machine-learning).**

You will need to translate the concepts to work with the data in this notebook, the Iowa data. Each page in the Machine Learning course includes instructions for what code to write at that step in the course.

# Write Your Code Below

In [None]:
import pandas as pd

#Save the file path to a variable
main_file_path = '../input/house-prices-advanced-regression-techniques/train.csv' # this is the path to the Iowa data that you will use

# read the data and store data in DataFrame
data = pd.read_csv(main_file_path)

# Run this code block with the control-enter keys on your keyboard. Or click the blue botton on the left
print('Some output from running this cell')

In [None]:
# read the data and store data in DataFrame
iowa_data = pd.read_csv(main_file_path)

# Use the .describe() method to print the data summary
print(iowa_data.describe())

## Selecting & Filtering data

In [None]:
# Use .columns() fuction to view all columns
print(iowa_data.columns)

## Selecting Single Column

In [None]:
# store the series of prices separately
iowa_price = iowa_data.SalePrice

# the head command returns the top few lines of data.
print(iowa_price.head())

In [None]:
# Can use describe method on a single column as well
iowa_price.describe()

In [None]:
iowa_lotconfig = iowa_data.LotConfig

print(iowa_lotconfig.head())

In [None]:
iowa_lotconfig.describe()

In [None]:
iowa_garage = iowa_data.GarageCars

print(iowa_garage.head())

In [None]:
iowa_garage.describe()

## Selecting Multiple Columns

In [None]:
# Use two or more variables fom the columns above and save them to a new data frame
iowa_multiple = ['YearBuilt','LotArea','GarageCars']

In [None]:
two_columns = iowa_data[iowa_multiple]

In [None]:
# Using the describe command view the summary of the data
two_columns.describe()

In [None]:
next_columns = ['SalePrice','GarageCars','YearBuilt','LotArea']

In [None]:
next_columns1 = iowa_data[next_columns]

In [None]:
next_columns1.describe()

# Building the first model:

### Choosing the Prediction target

In [None]:
#Select the target variable, which corresponds to the sales price. 
#Looking at previous commands may help you remember what this column is called. Save this to a new variable called y.

y = iowa_data.SalePrice

## Choosing Predictors

In [None]:
#Create a list of the names of the predictors we will use in the initial model. 
#Use just the following columns in the list (you can copy and paste the whole list to save some typing, though you'll still need to add quotes):

iowa_predictors = ['LotArea','YearBuilt','1stFlrSF','2ndFlrSF','FullBath','BedroomAbvGr','TotRmsAbvGrd']

In [None]:
#Using the list of variable names you just created, select a new DataFrame of the predictors data. Save this with the variable name X.

x = iowa_data[iowa_predictors]

## Building the Model

In [None]:
#Create a DecisionTreeRegressorModel and save it to a variable (with a name like my_model or iowa_model). 
#Ensure you've done the relevant import so you can run this command.

from sklearn.tree import DecisionTreeRegressor

# Define model
iowa_model = DecisionTreeRegressor()

In [None]:
#Fit the model you have created using the data in X and the target data you saved above.
# Fit model
iowa_model.fit(x,y)

In [None]:
# Make a few predictions with the model's predict command and print out the predictions.


print('Making Predictions for the following five homes:')
print(x.head())

In [None]:
print('The predictions are:')
print(iowa_model.predict(x.head()))

In [None]:
print('The predictions for the whole model are: ')
print(iowa_model.predict(x))

In [None]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = iowa_model.predict(x)
mean_absolute_error(y,predicted_home_prices)

## Model Validation 

### Using In-Sample Scores

In [None]:
from sklearn.tree import DecisionTreeRegressor

# Define model
iowa_model = DecisionTreeRegressor()

# Fit model
iowa_model.fit(x,y)

Then we calculate the MAE

In [None]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = iowa_model.predict(x)
mean_absolute_error (y,predicted_home_prices)

Above is MAE calculated using in-sample scores

This can be inaccuarate since, we used same data to build the model and to calculate the performance(validity)

To solve this issue we can use the train_test_split function to split the data into two parts. 

### July/04/2018

## Model Validation 

In [None]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both predictors and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.

train_x, val_x, train_y, val_y = train_test_split(x,y,random_state = 0)

# Define model
iowa_model = DecisionTreeRegressor()

# fit model
iowa_model.fit(train_x,train_y)


# get predicted prices on validation data
val_predictions = iowa_model.predict(val_x)
print(mean_absolute_error(val_y,val_predictions))


## Experiment with Different Models

### Overfitting vs Underfitting

In [None]:
# overfitting - Creating a deeper tree with too many leaves

# Underfitting - Creating a tree with less branches with less leaves

''' In the case of overfitting, we end up with many leaves with less data, and on the case of underfitting, we end up with less leaves with too much data. 
    Either case is not good for validation and a good model '''

### EX: 

  #### max_leaf_nodes

   #### Using a utility function to compare MAE scores different values from max_leaf_nodes

In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, predictors_train, predictors_val, targ_train, targ_val):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(predictors_train, targ_train)
    preds_val = model.predict(predictors_val)
    mae = mean_absolute_error(targ_val, preds_val)
    return(mae)

In [None]:
#The data is loaded into train_X, val_X, train_y and val_y using the code you've already seen (and which you've already written).

In [None]:
#### Now we can use a for loop to compare accuracy of models with different values 

In [None]:
# compare MAE with differing values of max_leaf_nodes

for max_leaf_nodes in [5,50,500,5000]:
    my_mae = get_mae(max_leaf_nodes, train_x, val_x, train_y, val_y)
    print('Max Leaf Nodes: %d \t\t Mean Absolute Error: %d' %(max_leaf_nodes,my_mae))

#### What is the Optimal number of leaves? 

### Conclusion

Overfitting - Captures many patterns that won't recur leading to less accurate predictions 

Underfitting - does not capture enough patterns again leading to less accurate predictions 

Validation data that is not used in the training model  can be used to measure the measure accuracy 

# Random Forests

#### Random forests are built the same way as  Decision Tree in Scikit-Learn 

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor()
forest_model.fit(train_x,train_y)
iowa_preds = forest_model.predict(val_x)
print(mean_absolute_error(val_y,iowa_preds))

## Submitting from a kernal

#### EX: 

In [None]:
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestRegressor

# Read the data
train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')


# pull data into target (y) and predictors (X)
train_y = train.SalePrice

predictor_columns = ['LotArea', 'OverallQual', 'YearBuilt', 'TotRmsAbvGrd']

# Create training predictors data
train_x = train[predictor_columns]

my_model = RandomForestRegressor()
my_model.fit(train_x,train_y)

#### Preparing Test Data

In [None]:
# Read the test data
test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

# Treat the test data in the same way as training data. In this case, pull same columns.
test_x = test[predictor_columns]

# Use the model to make predictions
predicted_prices = my_model.predict(test_x)

# We will look at the predicted prices to ensure we have something sensible.
print(predicted_prices)

> ## Prepare Submission File

- Submissions are made in csv files
- Usually have 2 columns id column and prediction column
- id column comes from the test data and the prediction colum will use the name of the target field

- Then we careate DataFrame with this data
- Then use th dataframe to csv method to write submission file 
- Need to use index = False,  ***this prevent Pandas from adding another column to our csv file 


In [None]:
my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': predicted_prices})

# you could use any filename. We use 'submission' here
my_submission.to_csv('submission.csv',index=False)

## Make Submison 

- hit the PUBLISH button on the screen
- It will ru the Kernal
- Then we will have a tab for OUTPUT
- (This only shows up when we prepare an output file like we did in the above prepare submissoin file section)

## Last Step

- Click OUTPUT button 
- Then you will be prompted to SUBMIT COMPETETION screen
- Then you will see the performance of the model
- you can always go back and eidt the file


**If you have any questions or hit any problems, come to the [Learn Discussion](https://www.kaggle.com/learn-forum) for help. **

**Return to [ML Course Index](https://www.kaggle.com/learn/machine-learning)**